Closed trueNAHO closed 3 months ago
Won't this have a negative impact on evaluation time?
(Slightly relevant - Nix has a built in profiler which can measure this by counting function calls, memory use, etc.)
Won't this have a negative impact on evaluation time?
Generally, I assume generating large strings via the CPU to be faster than reading large strings via the filesystem; at least in non-compiled languages. Nonetheless, similar to compiled languages, I would prioritize build time (when all the bash
code runs) over Nix evaluation time because one derivation may be built several times while being evaluated only once.
(Slightly relevant - Nix has a built in profiler which can measure this by counting function calls, memory use, etc.)
Somehow I am unaware of such a magnificent built-in functionality. A quick search yields:
Do you mind pointing me in the right direction for me to discover this built-in profiler? Then I could profile both implementations.
It's enabled by setting these environment variables:
NIX_SHOW_STATS=1
NIX_SHOW_STATS_PATH=stats.json
You can also add NIX_COUNT_CALLS=1
to track individual function calls.
Run this before and after the change with different output files, then use diff
to easily see what changed.
Run this before and after the change with different output files, then use
diff
to easily see what changed.
Currently, I have a pretty weird Home Manager setup, and I am not entirely sure how representative the following benchmark is. This benchmark merely enables Stylix without any other or further Home Manger configuration. Note that I ran each version only once instead of multiple times to compute an average.
If my benchmark has any real-world value, then unrolling the loop increases the cpuTime
by 0.013771
, which seems fairly neglectable. However, I have not benchmarked how much faster the Bash code execution is.
To further increase the cpuTime
, we could hard-code the Nix-unrolled loop, although it would somewhat reduce code quality. What should we do?
For reference, here are the entire benchmark results:
upstream/master
{
"attributes": null,
"cpuTime": 0.008786000311374664,
"envs": {
"bytes": 0,
"elements": 0,
"number": 0
},
"functions": [],
"gc": {
"heapSize": 402915328,
"totalBytes": 8256
},
"list": {
"bytes": 24,
"concats": 0,
"elements": 3
},
"nrAvoided": 4,
"nrFunctionCalls": 0,
"nrLookups": 2,
"nrOpUpdateValuesCopied": 0,
"nrOpUpdates": 0,
"nrPrimOpCalls": 1,
"nrThunks": 2,
"primops": {
"import": 1
},
"sets": {
"bytes": 2320,
"elements": 139,
"number": 6
},
"sizes": {
"Attr": 16,
"Bindings": 16,
"Env": 16,
"Value": 24
},
"symbols": {
"bytes": 2335,
"number": 254
},
"values": {
"bytes": 3024,
"number": 126
}
}
trueNAHO/perf-modules-kde-hm-loop-unroll-activation-script
{
"attributes": null,
"cpuTime": 0.022556999698281288,
"envs": {
"bytes": 0,
"elements": 0,
"number": 0
},
"functions": [],
"gc": {
"heapSize": 402915328,
"totalBytes": 8256
},
"list": {
"bytes": 24,
"concats": 0,
"elements": 3
},
"nrAvoided": 4,
"nrFunctionCalls": 0,
"nrLookups": 2,
"nrOpUpdateValuesCopied": 0,
"nrOpUpdates": 0,
"nrPrimOpCalls": 1,
"nrThunks": 2,
"primops": {
"import": 1
},
"sets": {
"bytes": 2320,
"elements": 139,
"number": 6
},
"sizes": {
"Attr": 16,
"Bindings": 16,
"Env": 16,
"Value": 24
},
"symbols": {
"bytes": 2335,
"number": 254
},
"values": {
"bytes": 3024,
"number": 126
}
}
diff upstream/master trueNAHO/perf-modules-kde-hm-loop-unroll-activation-script
3c3
< "cpuTime": 0.008786000311374664,
---
> "cpuTime": 0.022556999698281288,
What should we do?
We should first improve the benchmark results.
As expected, the loop unrolling implementation outperforms the for-loop implementation.
The loop unrolling implementation outperforms the for-loop implementation in extreme cases ("Maximum" and "Minimum") with neglectable "Mean" and "Median" differences:
Implementation | Maximum | Minimum | Mean | Median |
---|---|---|---|---|
No Loop Unrolling | 0.029216999188066 |
0.0060549997724593 |
0.012693397997413 |
0.010975999757648 |
Loop Unrolling | 0.028155000880361 |
0.0048170001246035 |
0.013636807999574 |
0.011239500250667 |
For reference, the following plot of the 1000 home-manager build
benchmarks has some spikes, presumably due to unrelated OS operations:
Note: I ran the benchmarks on a cold boot from the TTY with no other user processes.
Loop unrolling slightly improves performance of the generated Bash code. However, the benchmark results vary greatly considering that each iteration runs only for a few milliseconds.
For reference, run the benchmarks with the following command and accompanying Nix-generated files.
loop_unrolling_match.sh
wallpaperImage="$([[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; })"
lookAndFeel="$([[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; }; [[ -f "/etc/passwd" ]] && { printf '%s\n' "/etc/passwd"; exit 0; })"
loop_unrolling_no_match.sh
wallpaperImage="$([[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; })"
lookAndFeel="$([[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; }; [[ -f "/dev/null" ]] && { printf '%s\n' "/dev/null"; exit 0; })"
no_loop_unrolling_match.sh
globalPath() {
for dir in /etc /etc /etc; do
if [[ -f "$dir/$1" ]]; then
printf '%s\n' "$dir/$1"
break
fi
done
}
wallpaperImage="$(globalPath passwd)"
lookAndFeel="$(globalPath passwd)"
no_loop_unrolling_no_match.sh
globalPath() {
for dir in /dev /dev /dev; do
if [[ -f "$dir/$1" ]]; then
printf '%s\n' "$dir/$1"
break
fi
done
}
wallpaperImage="$(globalPath null)"
lookAndFeel="$(globalPath null)"
printf \
'%s: %s\n' \
"'sudo' permission is required to drop caches" \
"https://github.com/sharkdp/hyperfine/blob/c0921351e7eb9814fc829dbf48a5c745c4b0ad91/README.md#warmup-runs-and-preparation-commands";
sudo --validate &&
nix run nixpkgs#hyperfine -- \
--ignore-failure \
--prepare "sh -c 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches'" \
--runs 3000 \
--shell=none \
'bash loop_unrolling_match.sh' \
'bash loop_unrolling_no_match.sh' \
'bash no_loop_unrolling_match.sh' \
'bash no_loop_unrolling_no_match.sh';
nix run nixpkgs#hyperfine -- \
--ignore-failure \
--prepare "sh -c 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches'" \
--runs 3000 \
--shell=none \
'bash no_loop_unrolling_match.sh' \
'bash no_loop_unrolling_no_match.sh' \
'bash loop_unrolling_match.sh' \
'bash loop_unrolling_no_match.sh';
nix run nixpkgs#hyperfine -- \
--ignore-failure \
--warmup 1000 \
--runs 3000 \
--shell=none \
'bash loop_unrolling_match.sh' \
'bash loop_unrolling_no_match.sh' \
'bash no_loop_unrolling_match.sh' \
'bash no_loop_unrolling_no_match.sh';
nix run nixpkgs#hyperfine -- \
--ignore-failure \
--warmup 1000 \
--runs 3000 \
--shell=none \
'bash no_loop_unrolling_match.sh' \
'bash no_loop_unrolling_no_match.sh' \
'bash loop_unrolling_match.sh' \
'bash loop_unrolling_no_match.sh'
IMHO, the performance improvement is practically non-existent, and the implementation should be picked based on source code maintainability.
Personally, both implementations are equally readable and maintainable.
Nonetheless, I would assume that the loop unrolling implementation marginally reduces peak memory usage of Nix because the literal string is generated on demand instead of loaded all at once. However, I am not sure how correct this statement is in a lazy language.
Personally, both implementations are equally readable and maintainable.
https://github.com/danth/stylix/pull/314 is more readable and maintainble.
Nonetheless, I would assume that the loop unrolling implementation marginally reduces peak memory usage of Nix because the literal string is generated on demand instead of loaded all at once. However, I am not sure how correct this statement is in a lazy language.
With the loop unrolling in Nix, I believe it will generate each iteration as a string, then another string with all iterations concatenated together, then finally combine this with the rest of the script. I think Nix is garbage-collected so each stage may remain in memory longer than it is required, therefore actually increasing the peak memory usage compared to hard-coding the final script.
And I agree, maintainability should be prioritized over a very minor difference in performance.
If you're looking to improve the performance of Stylix, I would recommend focusing on more impactful changes such as reducing import-from-derivation usage (which will improve evaluation time) or reducing the number of package rebuilds we cause (which will greatly improve build time).
Improve the Nix run time by loop unrolling at Nix build time.
Additionally, this change replaces the
echo
command from the original loop withprintf '%s\n'
to align with Bash's best practices.