Open edolstra opened 2 weeks ago
Tested this on my M1 Pro MBP running NixOS 24.05:
14.61user 2.75system 0:23.08elapsed 75%CPU (0avgtext+0avgdata 7967760maxresident)k
17952inputs+0outputs (63major+498626minor)pagefaults 0swaps
28.69user 2.54system 0:05.35elapsed 583%CPU (0avgtext+0avgdata 7744096maxresident)k
0inputs+0outputs (6major+515584minor)pagefaults 0swaps
The duration numbers are good, but the throughput on @RossComputerGuy's mac looks a little worrying; it appears to be half as efficient (user<time>
).
This could be explained somewhat by the evaluation being memory bound, except the M1 is supposed to have amazing memory bandwidth.
It'd also be interesting to compare against the multi-threaded build with NR_CORES=1
, as well as the other values for it, to see how it scales.
@roberth Yeah, I'm not sure why it was like that. I just recently did an update which might've updated the kernel from 6.8.9-asahi to something newer but I didn't reboot. I ran the first command mentioned by Eelco. I could do a fresh boot and get the times down and run the other command as well. It always could be the Asahi kernel doesn't quite have the memory timing stuff quite as optimized as macOS. Could boot macOS and see what happens. But I do have 64 cores Ampere on the way so I always could give that a try when that arrives.
Note that Determinate Systems has already blogged about this, which is fine (that's just part of content marketing, and it's more than ok to be excited), but let's recognize that more work needs to be done to:
Especially don't underestimate the first point. Nix is a critical component of users' systems, so we mitigate risks carefully. I think part of that is limiting this to comparatively non-critical use cases such as nix search
, but that also means that we should all expect a delay in the delivery of this feature. To set expectations, if nix build
enables this within a year, Eelco and DetSys must have done a stellar job on this.
Motivation
This PR makes the evaluator thread-safe. Currently, only
nix flake search
andnix flake show
make use of multi-threaded evaluation to achieve a speedup on multicore systems.Unlike the previous attempt at a multi-threaded evaluator, this one locks thunks to prevent them from being evaluated more than once. The life of a thunk is now:
forceValue()
call, the thunk type goes fromtThunk
totPending
.forceValue()
on a thunk in thetPending
state, it acquires a lock to register itself as "awaiting" that value, and sets the type totAwaited
.tAwaited
, it updates the value and wakes up the threads that are waiting. If the type istPending
, it just updates the value normally.Also, there now is a
tFailed
value type that stores an exception pointer to represent the case where thunk evaluation throws an exception. In that case, every thread that forces the thunk should get the same exception.To enable multi-threaded evaluation, you need to set the
NR_CORES
environment variable to the number of threads to use. You can also setNIX_SHOW_THREAD_STATS=1
to get some debug statistics.Some benchmark results on a Ryzen 5900X with 12 cores and 24 hyper-threads:
NR_CORES=12 GC_INITIAL_HEAP_SIZE=8G nix flake show --no-eval-cache --all-systems --json github:NixOS/nix/afdd12be5e19c0001ff3297dea544301108d298
went from 23.70s to 5.77s.NR_CORES=16 GC_INITIAL_HEAP_SIZE=6G time nix search --no-eval-cache github:NixOS/nixpkgs/bf8462aeba50cc753971480f613fbae0747cffc0?narHash=sha256-bPyv7hsbtuxyL6LLKtOYL6QsmPeFWP839BZQMd3RoUg%3D ^
went from 11.82s to 3.88s.Note: it's good to set
GC_INITIAL_HEAP_SIZE
to a high value because stop-the-world garbage collection is expensive.To do:
nix flake check
.Executor
class currently executes work items in random order to reduce the probability that we execute a bunch of items at the same time that all depend on the same thunk, causing all but one to be blocked. This can probably be improved.Context
Priorities and Process
Add :+1: to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.