Closed dpc closed 1 month ago
I've noticed zero speedup
because we have so many crates, all are already compiling in parallel.
if you compile one project then it will speed up.
like cargo b -p fedimint-cli
because we have so many crates, all are already compiling in parallel.
That's a rare use-case, a chunk of that time is already fast linking with mold, and then it will be like 2 seconds anyway.
@elsirion I remember you showing a pretty graph showing cpus were under-utilized and mentioning lots of cores. Do you happen to have a machine like that? BTW. What was that tool that generated that pretty graph?
What was that tool that generated that pretty graph?
cargo build --timings maybe?
What was that tool that generated that pretty graph?
cargo --timings maybe?
See my edit above. I don't think that's it.
that's nix run nixpkgs#btop
also we need to bench cargo check
instead of cargo build
because that is only runs frontend (which was parallelized in rustc multithread)
No extra threads:
Full check debug: 34.08 458.20 41.41
Incr check debug: 2.84 3.57 2.17
Full build debug: 62.37 1248.81 90.31
Incr build debug: 9.20 16.95 8.51
Full check release: 31.61 258.52 35.97
Incr check release: 9.30 14.43 2.28
Full build release: 84.01 1726.18 80.97
Incr build release: 43.24 990.00 27.75
THREADS=8
Full check debug: 31.59 524.34 56.68
Incr check debug: 3.44 6.48 5.62
Full build debug: 60.95 1378.73 116.51
Incr build debug: 9.54 22.44 12.21
Full check release: 31.79 325.78 52.02
Incr check release: 5.94 23.59 3.47
Full build release: 82.33 1820.43 105.97
Incr build release: 42.83 1018.34 32.54
THREADS=20
Full check debug: 32.78 525.24 56.50
Incr check debug: 3.49 6.55 5.63
Full build debug: 62.24 1365.69 114.37
Incr build debug: 9.00 22.19 11.84
Full check release: 31.29 319.97 52.68
Incr check release: 5.90 23.14 3.44
Full build release: 82.45 1836.50 110.86
Incr build release: 42.22 1031.65 32.38
Not going to fix the padding, too much work. :D
Notably when threads are enabled the sys
increases (I guess caused by whatever the parallelism overhead), which makes what's already small and fast suffer a bit, canceling improvements.
I guess on system that have more, but slower cores the gain might be better(?).
Master:
CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz
Date: 2024-03-11
Commit: 24e8e3a78f
total user sys
Full check debug: 51.72 599.06 48.98
Incr check debug: 3.75 5.32 2.81
Full build debug: 94.17 1582.06 106.13
Incr build debug: 13.05 24.85 10.57
Full check release: 37.83 324.82 44.15
Incr check release: 12.90 20.33 3.12
Full build release: 118.29 2431.49 95.96
Incr build release: 74.97 1463.55 35.07
This branch:
CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz
Date: 2024-03-11
Commit: d66c68f708
total user sys
Full check debug: 46.99 708.62 72.70
Incr check debug: 4.42 8.32 6.47
Full build debug: 101.06 1876.18 152.78
Incr build debug: 12.87 29.63 14.44
Full check release: 46.30 438.94 68.37
Incr check release: 7.82 30.79 4.42
Full build release: 133.48 2511.73 138.47
Incr build release: 70.44 1380.19 45.76
So this branch is slower in most stuff.
I see a slight improvement in release mode, but might be noise, debug is slightly slower:
Full check debug: 49.10 968.92 71.50
Incr check debug: 6.59 15.41 8.99
Full build debug: 84.02 2480.56 149.89
Incr build debug: 12.49 40.47 16.98
Full check release: 53.20 582.53 65.35
Incr check release: 10.51 44.11 3.95
Full build release: 114.46 3019.66 142.44
Incr build release: 53.65 1587.94 36.96
I was going to close it, but maybe I could try https://www.reddit.com/r/rust/comments/1bgyo8a/try_cranelift_codegen_backend_for_faster_compile/ while at it.
Nice!
Before cranelift:
total user sys
Full check debug: 27.67 488.49 50.46
Incr check debug: 3.53 7.27 6.07
Full build debug: 56.31 1245.94 101.50
Incr build debug: 10.13 25.44 14.60
After cranelift:
total user sys
Full check debug: 20.87 275.16 44.34
Incr check debug: 3.77 7.64 5.98
Full build debug: 29.92 395.84 65.06
Incr build debug: 7.75 23.76 11.82
Before:
Commit: be2815f684
total user sys
Full check debug: 50.83 782.93 51.03
Incr check debug: 4.77 7.67 2.82
Full build debug: 76.18 1951.05 107.73
Incr build debug: 11.91 29.63 10.66
After:
Commit: 0011b5374d
total user sys
Full check debug: 28.99 510.35 59.83
Incr check debug: 7.23 17.43 9.65
Full build debug: 39.40 719.91 83.11
Incr build debug: 11.61 43.16 15.26
I like! :heart_eyes:
It's green. It's only enabled in dev builds, so IMO low risk. Let's give it a spin!
Bummer.
> cat target-nix/devimint/logs/fedimintd-0.log
2024-03-18T22:05:11.213297Z INFO consensus: Starting config gen
2024-03-18T22:05:11.213685Z INFO net::peer::dkg: Created new config gen Api
2024-03-18T22:05:11.215604Z INFO net::api: Starting api on ws://127.0.0.1:12969
2024-03-18T22:05:11.215874Z INFO fedimintd::fedimintd: Metrics API listening on 127.0.0.1:12976
trap at Instance { def: Item(DefId(1:15511 ~ core[0bd7]::core_arch::x86::sha::_mm_sha1rnds4_epu32)), args: [0_i32] } (_ZN4core9core_arch3x86
3sha19_mm_sha1rnds4_epu3217h09747cb42999c042E): llvm.x86.sha1rnds4
OK, so this can be fixed by updating nightly. :four_leaf_clover:
But now the problem is that dev builds and tests are much slower. 5x or so. I don't think anyone wants to wait 1-2 minutes to open just mprocs
:D
But now the problem is that dev builds and tests are much slower. 5x or so.
How so?
How so?
That's the whole thing with cranelift - it is supposed to be faster at the expense of optimization power. It only have basic optimizations currently, maybe it will get somewhat better in the future.
In our case it looks like consensus is getting extremely slow. Possibly just some crypto suffers from lack of certain optimization? Hard to tell. I tried increasing optimization level on certain things, but without success.
But now the problem is that dev builds and tests are much slower. 5x or so.
For cryptography related crates this project is using opt-level = 3: https://github.com/fedimint/fedimint/blob/29cf90e30a5c1d1b87625ef6bc435d8533828f17/Cargo.toml#L107-L125 I never quite benchmarked it, but I did expect cg_clif with optimizations enabled to be around the speed of LLVM opt-level = 1. It is in principle possible to compile these crates with LLVM by adding codegen-backend = "llvm"
to each of these package profile overrides (so for example secp256k1 = { opt-level = 3, codegen-backend = "llvm" }
.) Be aware however that there is currently an ABI incompatibility around 128bit integers: https://github.com/rust-lang/rustc_codegen_cranelift/issues/1449 I haven't gotten around fixing it properly yet.
ps: This is the project you mentioned in https://lobste.rs/s/tjh7oy/cranelift_code_generation_comes_rust#c_gq7fp3, right?
Edit: Opened https://github.com/rust-lang/rustc_codegen_cranelift/issues/1468 to track emitting a warning for this case to reduce confusion in the future.
@bjorn3 Oh, it's already possible to pick these per-crate? Awesome. I'll try it out and report back. Thanks for letting us know, and I'll sub these issues and keep an eye on it.
Did another round of testing, and pushed a best approach
Current master branch:
total user sys
Full check debug: 34.16 420.65 39.60
Incr check debug: 2.86 3.98 2.55
Full build debug: 61.04 1138.98 84.82
Incr build debug: 9.71 18.53 9.47
after with cranelift + parallel (but now with llvm for perf-heavy crates):
total user sys
Full check debug: 29.46 342.61 107.06
Incr check debug: 4.96 11.67 11.06
Full build debug: 40.04 487.93 126.50
Incr build debug: 8.91 29.45 17.05
just cranelift:
total user sys
Full check debug: 27.74 216.44 36.03
Incr check debug: 3.37 4.87 2.77
Full build debug: 38.91 336.45 54.94
Incr build debug: 7.87 17.36 7.19
no -O3 on local crypto crates (fedimint-threshold-crypto) + cranelift:
total user sys
Full check debug: 28.19 215.80 35.18
Incr check debug: 3.34 4.92 2.70
Full build debug: 39.73 335.34 54.93
Incr build debug: 8.69 17.95 7.95
O3 -> O2 + above:
total user sys
Full check debug: 28.17 212.10 34.99
Incr check debug: 3.32 4.85 2.73
Full build debug: 38.70 332.25 55.34
Incr build debug: 8.25 17.47 7.72
With this setup the env RUST_LOG=devimint=debug just devimint-env
starts as fast as before, so seems like a pure win now.
When I checkout the branch and direnv runs, the nix environment doesn't build.
$ git fetch dpc
$ git checkout go-nightly
branch 'go-nightly' set up to track 'dpc/go-nightly' by rebasing.
Switched to a new branch 'go-nightly'
[1 copied (0.3 MiB), 0.1 MiB DL] querying source on https://cache.nixos.orgdirenv: ([/nix/store/wrnrvg7ix2bmgnm3v4qvgvipgivak2ym-direnv-2.32.3/bin/direnv export fish]) is taking a while to execute. Use CTRL-C to give up.
error:
… while calling the 'derivationStrict' builtin
at /builtin/derivation.nix:9:12: (source not available)
… while evaluating derivation 'nix-shell'
whose name attribute is located at /nix/store/3i3rncs75fid9hwai5p2nvwc4ngdnia7-source/pkgs/stdenv/generic/make-derivation.nix:348:7
… while evaluating attribute 'nativeBuildInputs' of derivation 'nix-shell'
at /nix/store/3i3rncs75fid9hwai5p2nvwc4ngdnia7-source/pkgs/stdenv/generic/make-derivation.nix:392:7:
391| depsBuildBuild = elemAt (elemAt dependencies 0) 0;
392| nativeBuildInputs = elemAt (elemAt dependencies 0) 1;
| ^
393| depsBuildTarget = elemAt (elemAt dependencies 0) 2;
(stack trace truncated; use '--show-trace' to show the full trace)
error: attribute 'rustc-codegen-cranelift-preview' missing
at /nix/store/6nf10v3z293qh6qy9rlm7rcvzrz4i6m1-source/lib/mkFenixToolchain.nix:52:21:
51| (map
52| (component: fenix.packages.${system}.${channel}.${component})
| ^
53| components
OH. Of course. Great. I could make this component conditional, but then we would need to make the backend setting conditional on the current system, in Cargo.toml. And that is probably possibly only via build.rs setting up some feature flag, at which point it doesn't seem all that worthwhile.
Close the PR then for now?
can we have rustc wrapper that ignores the codegen-backend arguments on macos
61s => 40s on a fast system will be insane speed for slow systems.
can we have rustc wrapper that ignores the codegen-backend arguments on macos
There's only so much complexity we want to deal with.
In a couple of months MacOS support might be already implemented, and we can try-again.
Not planning to land it right now.
On top of pending PRs, due to possibility of merge conflicts.
See https://blog.rust-lang.org/2023/11/09/parallel-rustc.html
On my machine I've noticed zero speedup. It might speed up someone's compilation time, but I guess it would require lots of cpu cores. (more than 20 I have?).
Use
just bench-compilation
to bench. If anyone sees a speedup and it seems beneficial, we could land. Otherwise I'm a bit dissapointed.