chore: enable cranelift (switch to nightly)

dpc commented 2 months ago

Not planning to land it right now.

On top of pending PRs, due to possibility of merge conflicts.

See https://blog.rust-lang.org/2023/11/09/parallel-rustc.html

On my machine I've noticed zero speedup. It might speed up someone's compilation time, but I guess it would require lots of cpu cores. (more than 20 I have?).

Use just bench-compilation to bench. If anyone sees a speedup and it seems beneficial, we could land. Otherwise I'm a bit dissapointed.

maan2003 commented 2 months ago

I've noticed zero speedup

because we have so many crates, all are already compiling in parallel.

if you compile one project then it will speed up.

like cargo b -p fedimint-cli

dpc commented 2 months ago

because we have so many crates, all are already compiling in parallel.

That's a rare use-case, a chunk of that time is already fast linking with mold, and then it will be like 2 seconds anyway.

@elsirion I remember you showing a pretty graph showing cpus were under-utilized and mentioning lots of cores. Do you happen to have a machine like that? BTW. What was that tool that generated that pretty graph?

maan2003 commented 2 months ago

What was that tool that generated that pretty graph?

cargo build --timings maybe?

dpc commented 2 months ago

What was that tool that generated that pretty graph?

cargo --timings maybe?

See my edit above. I don't think that's it.

maan2003 commented 2 months ago

that's nix run nixpkgs#btop

maan2003 commented 2 months ago

also we need to bench cargo check instead of cargo build

because that is only runs frontend (which was parallelized in rustc multithread)

dpc commented 2 months ago

No extra threads:

Full check debug:       34.08   458.20  41.41
Incr check debug:       2.84    3.57    2.17
Full build debug:       62.37   1248.81 90.31
Incr build debug:       9.20    16.95   8.51
Full check release:     31.61   258.52  35.97
Incr check release:     9.30    14.43   2.28
Full build release:     84.01   1726.18 80.97
Incr build release:     43.24   990.00  27.75

THREADS=8

Full check debug:       31.59   524.34  56.68
Incr check debug:       3.44    6.48    5.62
Full build debug:       60.95   1378.73 116.51
Incr build debug:       9.54    22.44   12.21
Full check release:     31.79   325.78  52.02
Incr check release:     5.94    23.59   3.47
Full build release:     82.33   1820.43 105.97
Incr build release:     42.83   1018.34 32.54

THREADS=20

Full check debug:       32.78   525.24  56.50
Incr check debug:       3.49    6.55    5.63
Full build debug:       62.24   1365.69 114.37
Incr build debug:       9.00    22.19   11.84
Full check release:     31.29   319.97  52.68
Incr check release:     5.90    23.14   3.44
Full build release:     82.45   1836.50 110.86
Incr build release:     42.22   1031.65 32.38

Not going to fix the padding, too much work. :D

Notably when threads are enabled the sys increases (I guess caused by whatever the parallelism overhead), which makes what's already small and fast suffer a bit, canceling improvements.

I guess on system that have more, but slower cores the gain might be better(?).

douglaz commented 2 months ago

Master:

CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz 

Date: 2024-03-11
Commit: 24e8e3a78f
                       total    user     sys
Full  check   debug:   51.72  599.06   48.98
Incr  check   debug:    3.75    5.32    2.81
Full  build   debug:   94.17 1582.06  106.13
Incr  build   debug:   13.05   24.85   10.57
Full  check release:   37.83  324.82   44.15
Incr  check release:   12.90   20.33    3.12
Full  build release:  118.29 2431.49   95.96
Incr  build release:   74.97 1463.55   35.07

This branch:

 CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz 

Date: 2024-03-11
Commit: d66c68f708
                    total   user    sys
Full check debug:   46.99   708.62  72.70
Incr check debug:   4.42    8.32    6.47
Full build debug:   101.06  1876.18 152.78
Incr build debug:   12.87   29.63   14.44
Full check release:     46.30   438.94  68.37
Incr check release:     7.82    30.79   4.42
Full build release:     133.48  2511.73 138.47
Incr build release:     70.44   1380.19 45.76

So this branch is slower in most stuff.

elsirion commented 2 months ago

I see a slight improvement in release mode, but might be noise, debug is slightly slower:

Full check debug:       49.10   968.92  71.50
Incr check debug:       6.59    15.41   8.99
Full build debug:       84.02   2480.56 149.89
Incr build debug:       12.49   40.47   16.98
Full check release:     53.20   582.53  65.35
Incr check release:     10.51   44.11   3.95
Full build release:     114.46  3019.66 142.44
Incr build release:     53.65   1587.94 36.96

dpc commented 2 months ago

I was going to close it, but maybe I could try https://www.reddit.com/r/rust/comments/1bgyo8a/try_cranelift_codegen_backend_for_faster_compile/ while at it.

dpc commented 2 months ago

Nice!

Before cranelift:

                       total    user     sys
Full  check   debug:   27.67  488.49   50.46
Incr  check   debug:    3.53    7.27    6.07
Full  build   debug:   56.31 1245.94  101.50
Incr  build   debug:   10.13   25.44   14.60

After cranelift:

                       total    user     sys
Full  check   debug:   20.87  275.16   44.34
Incr  check   debug:    3.77    7.64    5.98
Full  build   debug:   29.92  395.84   65.06
Incr  build   debug:    7.75   23.76   11.82

elsirion commented 2 months ago

Before:

Commit: be2815f684
                       total    user     sys
Full  check   debug:   50.83  782.93   51.03
Incr  check   debug:    4.77    7.67    2.82
Full  build   debug:   76.18 1951.05  107.73
Incr  build   debug:   11.91   29.63   10.66

After:

Commit: 0011b5374d                                                                                                                                                                                                                                                                        
                       total    user     sys                                                                                                                                                                                                                                              
Full  check   debug:   28.99  510.35   59.83                                                                                                                                                                                                                                              
Incr  check   debug:    7.23   17.43    9.65                                                                                                 
Full  build   debug:   39.40  719.91   83.11                                                                                                                                                                                                                                              
Incr  build   debug:   11.61   43.16   15.26

I like! :heart_eyes:

dpc commented 2 months ago

It's green. It's only enabled in dev builds, so IMO low risk. Let's give it a spin!

dpc commented 2 months ago

Bummer.

> cat target-nix/devimint/logs/fedimintd-0.log
2024-03-18T22:05:11.213297Z  INFO consensus: Starting config gen
2024-03-18T22:05:11.213685Z  INFO net::peer::dkg: Created new config gen Api
2024-03-18T22:05:11.215604Z  INFO net::api: Starting api on ws://127.0.0.1:12969
2024-03-18T22:05:11.215874Z  INFO fedimintd::fedimintd: Metrics API listening on 127.0.0.1:12976
trap at Instance { def: Item(DefId(1:15511 ~ core[0bd7]::core_arch::x86::sha::_mm_sha1rnds4_epu32)), args: [0_i32] } (_ZN4core9core_arch3x86
3sha19_mm_sha1rnds4_epu3217h09747cb42999c042E): llvm.x86.sha1rnds4

dpc commented 2 months ago

OK, so this can be fixed by updating nightly. :four_leaf_clover:

But now the problem is that dev builds and tests are much slower. 5x or so. I don't think anyone wants to wait 1-2 minutes to open just mprocs :D

elsirion commented 2 months ago

But now the problem is that dev builds and tests are much slower. 5x or so.

How so?

dpc commented 1 month ago

How so?

That's the whole thing with cranelift - it is supposed to be faster at the expense of optimization power. It only have basic optimizations currently, maybe it will get somewhat better in the future.

In our case it looks like consensus is getting extremely slow. Possibly just some crypto suffers from lack of certain optimization? Hard to tell. I tried increasing optimization level on certain things, but without success.

bjorn3 commented 1 month ago

But now the problem is that dev builds and tests are much slower. 5x or so.

For cryptography related crates this project is using opt-level = 3: https://github.com/fedimint/fedimint/blob/29cf90e30a5c1d1b87625ef6bc435d8533828f17/Cargo.toml#L107-L125 I never quite benchmarked it, but I did expect cg_clif with optimizations enabled to be around the speed of LLVM opt-level = 1. It is in principle possible to compile these crates with LLVM by adding codegen-backend = "llvm" to each of these package profile overrides (so for example secp256k1 = { opt-level = 3, codegen-backend = "llvm" }.) Be aware however that there is currently an ABI incompatibility around 128bit integers: https://github.com/rust-lang/rustc_codegen_cranelift/issues/1449 I haven't gotten around fixing it properly yet.

ps: This is the project you mentioned in https://lobste.rs/s/tjh7oy/cranelift_code_generation_comes_rust#c_gq7fp3, right?

Edit: Opened https://github.com/rust-lang/rustc_codegen_cranelift/issues/1468 to track emitting a warning for this case to reduce confusion in the future.

dpc commented 1 month ago

@bjorn3 Oh, it's already possible to pick these per-crate? Awesome. I'll try it out and report back. Thanks for letting us know, and I'll sub these issues and keep an eye on it.

dpc commented 1 month ago

Did another round of testing, and pushed a best approach

Current master branch:

                       total    user     sys
Full  check   debug:   34.16  420.65   39.60
Incr  check   debug:    2.86    3.98    2.55
Full  build   debug:   61.04 1138.98   84.82
Incr  build   debug:    9.71   18.53    9.47

after with cranelift + parallel (but now with llvm for perf-heavy crates):

                       total    user     sys
Full  check   debug:   29.46  342.61  107.06
Incr  check   debug:    4.96   11.67   11.06
Full  build   debug:   40.04  487.93  126.50
Incr  build   debug:    8.91   29.45   17.05

just cranelift:

                       total    user     sys
Full  check   debug:   27.74  216.44   36.03
Incr  check   debug:    3.37    4.87    2.77
Full  build   debug:   38.91  336.45   54.94
Incr  build   debug:    7.87   17.36    7.19

no -O3 on local crypto crates (fedimint-threshold-crypto) + cranelift:

                       total    user     sys
Full  check   debug:   28.19  215.80   35.18
Incr  check   debug:    3.34    4.92    2.70
Full  build   debug:   39.73  335.34   54.93
Incr  build   debug:    8.69   17.95    7.95

O3 -> O2 + above:

                       total    user     sys
Full  check   debug:   28.17  212.10   34.99
Incr  check   debug:    3.32    4.85    2.73
Full  build   debug:   38.70  332.25   55.34
Incr  build   debug:    8.25   17.47    7.72

With this setup the env RUST_LOG=devimint=debug just devimint-env starts as fast as before, so seems like a pure win now.

justinmoon commented 1 month ago

When I checkout the branch and direnv runs, the nix environment doesn't build.

$ git fetch dpc
$ git checkout go-nightly
branch 'go-nightly' set up to track 'dpc/go-nightly' by rebasing.
Switched to a new branch 'go-nightly'
[1 copied (0.3 MiB), 0.1 MiB DL] querying source on https://cache.nixos.orgdirenv: ([/nix/store/wrnrvg7ix2bmgnm3v4qvgvipgivak2ym-direnv-2.32.3/bin/direnv export fish]) is taking a while to execute. Use CTRL-C to give up.
error:
       … while calling the 'derivationStrict' builtin

         at /builtin/derivation.nix:9:12: (source not available)

       … while evaluating derivation 'nix-shell'
         whose name attribute is located at /nix/store/3i3rncs75fid9hwai5p2nvwc4ngdnia7-source/pkgs/stdenv/generic/make-derivation.nix:348:7

       … while evaluating attribute 'nativeBuildInputs' of derivation 'nix-shell'

         at /nix/store/3i3rncs75fid9hwai5p2nvwc4ngdnia7-source/pkgs/stdenv/generic/make-derivation.nix:392:7:

          391|       depsBuildBuild              = elemAt (elemAt dependencies 0) 0;
          392|       nativeBuildInputs           = elemAt (elemAt dependencies 0) 1;
             |       ^
          393|       depsBuildTarget             = elemAt (elemAt dependencies 0) 2;

       (stack trace truncated; use '--show-trace' to show the full trace)

       error: attribute 'rustc-codegen-cranelift-preview' missing

       at /nix/store/6nf10v3z293qh6qy9rlm7rcvzrz4i6m1-source/lib/mkFenixToolchain.nix:52:21:

           51|       (map
           52|         (component: fenix.packages.${system}.${channel}.${component})
             |                     ^
           53|         components

dpc commented 1 month ago

OH. Of course. Great. I could make this component conditional, but then we would need to make the backend setting conditional on the current system, in Cargo.toml. And that is probably possibly only via build.rs setting up some feature flag, at which point it doesn't seem all that worthwhile.

elsirion commented 1 month ago

Close the PR then for now?

maan2003 commented 1 month ago

can we have rustc wrapper that ignores the codegen-backend arguments on macos

maan2003 commented 1 month ago

61s => 40s on a fast system will be insane speed for slow systems.

dpc commented 1 month ago

can we have rustc wrapper that ignores the codegen-backend arguments on macos

There's only so much complexity we want to deal with.

In a couple of months MacOS support might be already implemented, and we can try-again.

fedimint / fedimint

chore: enable cranelift (switch to nightly) #4498