edolstra commented 2 weeks ago

Motivation

This PR makes the evaluator thread-safe. Currently, only nix flake search and nix flake show make use of multi-threaded evaluation to achieve a speedup on multicore systems.

Unlike the previous attempt at a multi-threaded evaluator, this one locks thunks to prevent them from being evaluated more than once. The life of a thunk is now:

On the first forceValue() call, the thunk type goes from tThunk to tPending.
If another thread does a forceValue() on a thunk in the tPending state, it acquires a lock to register itself as "awaiting" that value, and sets the type to tAwaited.
Once the first thread finished the value and its type is tAwaited, it updates the value and wakes up the threads that are waiting. If the type is tPending, it just updates the value normally.

Also, there now is a tFailed value type that stores an exception pointer to represent the case where thunk evaluation throws an exception. In that case, every thread that forces the thunk should get the same exception.

To enable multi-threaded evaluation, you need to set the NR_CORES environment variable to the number of threads to use. You can also set NIX_SHOW_THREAD_STATS=1 to get some debug statistics.

Some benchmark results on a Ryzen 5900X with 12 cores and 24 hyper-threads:

NR_CORES=12 GC_INITIAL_HEAP_SIZE=8G nix flake show --no-eval-cache --all-systems --json github:NixOS/nix/afdd12be5e19c0001ff3297dea544301108d298 went from 23.70s to 5.77s.
NR_CORES=16 GC_INITIAL_HEAP_SIZE=6G time nix search --no-eval-cache github:NixOS/nixpkgs/bf8462aeba50cc753971480f613fbae0747cffc0?narHash=sha256-bPyv7hsbtuxyL6LLKtOYL6QsmPeFWP839BZQMd3RoUg%3D ^ went from 11.82s to 3.88s.

Note: it's good to set GC_INITIAL_HEAP_SIZE to a high value because stop-the-world garbage collection is expensive.

To do:

Infinite recursion detection through blackholing is currently disabled.
More commands should be multi-threaded, in particular nix flake check.
We should have some auto-parallelization of single evaluations (like NixOS system configurations). One way to do this would be to evaluate all attributes of a derivation in parallel.
This PR make some high contention data structures (in particular the symbol table) more multi-thread friendly, but there is more that can be done.
The Executor class currently executes work items in random order to reduce the probability that we execute a bunch of items at the same time that all depend on the same thunk, causing all but one to be blocked. This can probably be improved.

Context

Priorities and Process

Add :+1: to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

RossComputerGuy commented 4 days ago

Tested this on my M1 Pro MBP running NixOS 24.05:

Before

14.61user 2.75system 0:23.08elapsed 75%CPU (0avgtext+0avgdata 7967760maxresident)k
17952inputs+0outputs (63major+498626minor)pagefaults 0swaps

After

28.69user 2.54system 0:05.35elapsed 583%CPU (0avgtext+0avgdata 7744096maxresident)k
0inputs+0outputs (6major+515584minor)pagefaults 0swaps

roberth commented 4 days ago

The duration numbers are good, but the throughput on @RossComputerGuy's mac looks a little worrying; it appears to be half as efficient (user<time>). This could be explained somewhat by the evaluation being memory bound, except the M1 is supposed to have amazing memory bandwidth. It'd also be interesting to compare against the multi-threaded build with NR_CORES=1, as well as the other values for it, to see how it scales.

RossComputerGuy commented 4 days ago

@roberth Yeah, I'm not sure why it was like that. I just recently did an update which might've updated the kernel from 6.8.9-asahi to something newer but I didn't reboot. I ran the first command mentioned by Eelco. I could do a fresh boot and get the times down and run the other command as well. It always could be the Asahi kernel doesn't quite have the memory timing stuff quite as optimized as macOS. Could boot macOS and see what happens. But I do have 64 cores Ampere on the way so I always could give that a try when that arrives.

roberth commented 3 days ago

Note that Determinate Systems has already blogged about this, which is fine (that's just part of content marketing, and it's more than ok to be excited), but let's recognize that more work needs to be done to:

make sure this will be stable operationally
have no regressions
characterize performance in terms of throughput instead of just latency
tune this reasonably (e.g. probably not all cores on a multi-CPU or NUMA system).

Especially don't underestimate the first point. Nix is a critical component of users' systems, so we mitigate risks carefully. I think part of that is limiting this to comparatively non-critical use cases such as nix search, but that also means that we should all expect a delay in the delivery of this feature. To set expectations, if nix build enables this within a year, Eelco and DetSys must have done a stellar job on this.

Planning, opinion, Nix team

Also note that the Nix team is unlikely to prioritize this work because other areas need attention, such as fixes and interface stability, which essentially means _finishing_ things instead of starting new projects, esp. without consulting the team. All progress is good, and if this amplifies our pace of development because it aligns with DetSys and/or other contributors, that'd be great, but if it's zero-sum, it'd be inefficient not to focus on other work that's already in flight.

NixOS / nix

Multithreaded evaluator #10938

Motivation

Context

Priorities and Process

Before

After