trueNAHO commented 3 months ago

Improve the Nix run time by loop unrolling at Nix build time.

Additionally, this change replaces the echo command from the original loop with printf '%s\n' to align with Bash's best practices.

danth commented 3 months ago

Won't this have a negative impact on evaluation time?

(Slightly relevant - Nix has a built in profiler which can measure this by counting function calls, memory use, etc.)

trueNAHO commented 3 months ago

Won't this have a negative impact on evaluation time?

Generally, I assume generating large strings via the CPU to be faster than reading large strings via the filesystem; at least in non-compiled languages. Nonetheless, similar to compiled languages, I would prioritize build time (when all the bash code runs) over Nix evaluation time because one derivation may be built several times while being evaluated only once.

(Slightly relevant - Nix has a built in profiler which can measure this by counting function calls, memory use, etc.)

Somehow I am unaware of such a magnificent built-in functionality. A quick search yields:

https://discourse.nixos.org/t/19934
- https://github.com/milahu/nix-build-profiler

Do you mind pointing me in the right direction for me to discover this built-in profiler? Then I could profile both implementations.

danth commented 3 months ago

It's enabled by setting these environment variables:

NIX_SHOW_STATS=1 NIX_SHOW_STATS_PATH=stats.json

You can also add NIX_COUNT_CALLS=1 to track individual function calls.

Run this before and after the change with different output files, then use diff to easily see what changed.

trueNAHO commented 3 months ago

Run this before and after the change with different output files, then use diff to easily see what changed.

Currently, I have a pretty weird Home Manager setup, and I am not entirely sure how representative the following benchmark is. This benchmark merely enables Stylix without any other or further Home Manger configuration. Note that I ran each version only once instead of multiple times to compute an average.

If my benchmark has any real-world value, then unrolling the loop increases the cpuTime by 0.013771, which seems fairly neglectable. However, I have not benchmarked how much faster the Bash code execution is.

To further increase the cpuTime, we could hard-code the Nix-unrolled loop, although it would somewhat reduce code quality. What should we do?

For reference, here are the entire benchmark results:

`upstream/master`

{
  "attributes": null,
  "cpuTime": 0.008786000311374664,
  "envs": {
    "bytes": 0,
    "elements": 0,
    "number": 0
  },
  "functions": [],
  "gc": {
    "heapSize": 402915328,
    "totalBytes": 8256
  },
  "list": {
    "bytes": 24,
    "concats": 0,
    "elements": 3
  },
  "nrAvoided": 4,
  "nrFunctionCalls": 0,
  "nrLookups": 2,
  "nrOpUpdateValuesCopied": 0,
  "nrOpUpdates": 0,
  "nrPrimOpCalls": 1,
  "nrThunks": 2,
  "primops": {
    "import": 1
  },
  "sets": {
    "bytes": 2320,
    "elements": 139,
    "number": 6
  },
  "sizes": {
    "Attr": 16,
    "Bindings": 16,
    "Env": 16,
    "Value": 24
  },
  "symbols": {
    "bytes": 2335,
    "number": 254
  },
  "values": {
    "bytes": 3024,
    "number": 126
  }
}

`trueNAHO/perf-modules-kde-hm-loop-unroll-activation-script`

{
  "attributes": null,
  "cpuTime": 0.022556999698281288,
  "envs": {
    "bytes": 0,
    "elements": 0,
    "number": 0
  },
  "functions": [],
  "gc": {
    "heapSize": 402915328,
    "totalBytes": 8256
  },
  "list": {
    "bytes": 24,
    "concats": 0,
    "elements": 3
  },
  "nrAvoided": 4,
  "nrFunctionCalls": 0,
  "nrLookups": 2,
  "nrOpUpdateValuesCopied": 0,
  "nrOpUpdates": 0,
  "nrPrimOpCalls": 1,
  "nrThunks": 2,
  "primops": {
    "import": 1
  },
  "sets": {
    "bytes": 2320,
    "elements": 139,
    "number": 6
  },
  "sizes": {
    "Attr": 16,
    "Bindings": 16,
    "Env": 16,
    "Value": 24
  },
  "symbols": {
    "bytes": 2335,
    "number": 254
  },
  "values": {
    "bytes": 3024,
    "number": 126
  }
}

`diff upstream/master trueNAHO/perf-modules-kde-hm-loop-unroll-activation-script`

3c3
<   "cpuTime": 0.008786000311374664,
---
>   "cpuTime": 0.022556999698281288,

trueNAHO commented 3 months ago

What should we do?

We should first improve the benchmark results.

TLDR

As expected, the loop unrolling implementation outperforms the for-loop implementation.

Reference

Nix Evaluation

The loop unrolling implementation outperforms the for-loop implementation in extreme cases ("Maximum" and "Minimum") with neglectable "Mean" and "Median" differences:

Implementation	Maximum	Minimum	Mean	Median
No Loop Unrolling	`0.029216999188066`	`0.0060549997724593`	`0.012693397997413`	`0.010975999757648`
Loop Unrolling	`0.028155000880361`	`0.0048170001246035`	`0.013636807999574`	`0.011239500250667`

For reference, the following plot of the 1000 home-manager build benchmarks has some spikes, presumably due to unrelated OS operations:

plot

Note: I ran the benchmarks on a cold boot from the TTY with no other user processes.

Nix Build

Loop unrolling slightly improves performance of the generated Bash code. However, the benchmark results vary greatly considering that each iteration runs only for a few milliseconds.

For reference, run the benchmarks with the following command and accompanying Nix-generated files.

Files

`loop_unrolling_match.sh`

wallpaperImage="$([[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; })"
lookAndFeel="$([[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; })"

`loop_unrolling_no_match.sh`

wallpaperImage="$([[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; })"
lookAndFeel="$([[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; })"

`no_loop_unrolling_match.sh`

globalPath() {
  for dir in /etc /etc /etc; do
    if [[ -f "$dir/$1" ]]; then
      printf '%s\n' "$dir/$1"
      break
    fi
  done
}

wallpaperImage="$(globalPath passwd)"
lookAndFeel="$(globalPath passwd)"

`no_loop_unrolling_no_match.sh`

globalPath() {
  for dir in /dev /dev /dev; do
    if [[ -f "$dir/$1" ]]; then
      printf '%s\n' "$dir/$1"
      break
    fi
  done
}

wallpaperImage="$(globalPath null)"
lookAndFeel="$(globalPath null)"

Command

printf \
  '%s: %s\n' \
  "'sudo' permission is required to drop caches" \
  "https://github.com/sharkdp/hyperfine/blob/c0921351e7eb9814fc829dbf48a5c745c4b0ad91/README.md#warmup-runs-and-preparation-commands";
  sudo --validate &&
  nix run nixpkgs#hyperfine -- \
  --ignore-failure \
  --prepare "sh -c 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches'" \
  --runs 3000 \
  --shell=none \
  'bash loop_unrolling_match.sh' \
  'bash loop_unrolling_no_match.sh' \
  'bash no_loop_unrolling_match.sh' \
  'bash no_loop_unrolling_no_match.sh';
  nix run nixpkgs#hyperfine -- \
  --ignore-failure \
  --prepare "sh -c 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches'" \
  --runs 3000 \
  --shell=none \
  'bash no_loop_unrolling_match.sh' \
  'bash no_loop_unrolling_no_match.sh' \
  'bash loop_unrolling_match.sh' \
  'bash loop_unrolling_no_match.sh';
  nix run nixpkgs#hyperfine -- \
  --ignore-failure \
  --warmup 1000 \
  --runs 3000 \
  --shell=none \
  'bash loop_unrolling_match.sh' \
  'bash loop_unrolling_no_match.sh' \
  'bash no_loop_unrolling_match.sh' \
  'bash no_loop_unrolling_no_match.sh';
  nix run nixpkgs#hyperfine -- \
  --ignore-failure \
  --warmup 1000 \
  --runs 3000 \
  --shell=none \
  'bash no_loop_unrolling_match.sh' \
  'bash no_loop_unrolling_no_match.sh' \
  'bash loop_unrolling_match.sh' \
  'bash loop_unrolling_no_match.sh'

Conclusion

IMHO, the performance improvement is practically non-existent, and the implementation should be picked based on source code maintainability.

Personally, both implementations are equally readable and maintainable.

Nonetheless, I would assume that the loop unrolling implementation marginally reduces peak memory usage of Nix because the literal string is generated on demand instead of loaded all at once. However, I am not sure how correct this statement is in a lazy language.

trueNAHO commented 3 months ago

Personally, both implementations are equally readable and maintainable.

https://github.com/danth/stylix/pull/314 is more readable and maintainble.

danth commented 3 months ago

Nonetheless, I would assume that the loop unrolling implementation marginally reduces peak memory usage of Nix because the literal string is generated on demand instead of loaded all at once. However, I am not sure how correct this statement is in a lazy language.

With the loop unrolling in Nix, I believe it will generate each iteration as a string, then another string with all iterations concatenated together, then finally combine this with the rest of the script. I think Nix is garbage-collected so each stage may remain in memory longer than it is required, therefore actually increasing the peak memory usage compared to hard-coding the final script.

And I agree, maintainability should be prioritized over a very minor difference in performance.

If you're looking to improve the performance of Stylix, I would recommend focusing on more impactful changes such as reducing import-from-derivation usage (which will improve evaluation time) or reducing the number of package rebuilds we cause (which will greatly improve build time).

danth / stylix

kde: loop unroll activation script #303

`upstream/master`

`trueNAHO/perf-modules-kde-hm-loop-unroll-activation-script`

`diff upstream/master trueNAHO/perf-modules-kde-hm-loop-unroll-activation-script`

TLDR

Reference

Nix Evaluation

Nix Build

Files

`loop_unrolling_match.sh`

`loop_unrolling_no_match.sh`

`no_loop_unrolling_match.sh`

`no_loop_unrolling_no_match.sh`

Command

Conclusion