Nix 2.16 broke something in the way binary substituters are used

yajo commented 1 year ago

Describe the bug

Some of my CI pipelines are starting to break after starting to use Nix 2.16, without any other change. The break happens because the jobs last more than 1h, reaching their limit. Downgrading to 2.15.1 fixed the issue.

Steps To Reproduce

Try to build the devshell with Nix 2.16. It will take forever. Probably due to having to compile nodejs-18.15.0. I run it inside a container to imitate better how those CI jobs work, and to avoid local nix store to pollute fetch and build times:

➤ time podman container run --rm -it docker.io/nixos/nix:2.16.0 nix --extra-experimental-features 'flakes nix-command' --accept-flake-config build gitlab:moduon/mrchef/009d93eb7f0a9eabd3b248f5c10c7523422a7f56#devShells.x86_64-linux.default

My goal was to paste the time this takes to build, but after a couple of hours I got tired. It's too much anyway.

Expected behavior

Nix 2.15.1 was much faster. It didn't have to build anything, as all is in binary caches:

➤ time podman container run --rm -it docker.io/nixos/nix:2.15.1 nix --extra-experimental-features 'flakes nix-command' --accept-flake-config build gitlab:moduon/mrchef/009d93eb7f0a9eabd3b248f5c10c7523422a7f56#devShells.x86_64-linux.default

________________________________________________________
Executed in  523.40 secs    fish           external
   usr time    1.08 secs  238.00 micros    1.08 secs
   sys time    1.12 secs  719.00 micros    1.12 secs

Additional context

Here you have a CI job running on the docker image docker.io/nixos/nix, which resolved to sha256:01ab413500cf58ee2291607debcec6d1fc62a1ad2cb31f7022e8c1c25db45662 (2.15) some weeks ago; it ends in 3:05 minutes: https://gitlab.com/moduon/mrchef/-/jobs/4308352194

I retried the same job today, which resolves the image to sha256:18daab29b75942fe338afdb4198eb7df9944df7919a5a455c2d051f4c0394f78 (2.16); i stopped the job manually after 11:04 seconds because it was already too much: https://gitlab.com/moduon/mrchef/-/jobs/4461501535

Priorities

Add :+1: to issues you find important.

NobbZ commented 1 year ago

Given that containers usually run on some restricted subset of the hostmachine, have you played with the max-substitution-jobs setting in the container?

Perhaps something in the container can't cope the increased number of tasks/processes/jobs of the 16 substitutions?

Perhaps you had more than 16 concurrent downloads before because you have a hogher build-job count, and now are slowed down actually limitting the former 30+ jobs to just half of it?

https://nixos.org/manual/nix/stable/command-ref/conf-file.html#conf-max-substitution-jobs

yajo commented 1 year ago

I have checked with 1 and 2 max-substitution-jobs:

time podman container run --rm -it docker.io/nixos/nix:2.16.0 nix --extra-experimental-features 'flakes nix-command' --accept-flake-config --max-substitution-jobs 1 build gitlab:moduon/mrchef/009d93eb7f0a9eabd3b248f5c10c7523422a7f56#devShells.x86_64-linux.default

time podman container run --rm -it docker.io/nixos/nix:2.16.0 nix --extra-experimental-features 'flakes nix-command' --accept-flake-config --max-substitution-jobs 2 build gitlab:moduon/mrchef/009d93eb7f0a9eabd3b248f5c10c7523422a7f56#devShells.x86_64-linux.default

Both attempt to rebuild nodejs and take forever, while the 2.15.1 image still works perfectly and downloads everything from binary cache.

Could you attempt to reproduce locally? The commands I posted are quite reproducible.

NobbZ commented 1 year ago

With a single substitution job I indeed see a massive slow down. With the original 16 I do not see any slowdown compared to running nix without docker (I tested in docker not podman) and the concurrent downloads are saturating my downstream.

I can not see any noticeable slowdown in the build of nodejs itself.

I will post timings of both runs once I have them avaible, this can take quite a while, as I let the first build finish first, then do the next, and I am leaving for a doctors appointment now.

NobbZ commented 1 year ago

I can reproduce your issue, though this looks much more like a reproducibility issue rather than a substitution issue:

$ docker container run --rm -it docker.io/nixos/nix:2.15.1 nix --extra-experimental-features 'flakes nix-command' --accept-flake-config eval gitlab:moduon/mrchef/009d93eb7f0a9eabd3b248f5c10c7523422a7f56#devShells.x86_64-linux.default.outPath
warning: refname 'HEAD' is ambiguous.
"/nix/store/p060fx46vy1sywd1i3hbhxhcr715480q-devshell"
$ docker container run --rm -it docker.io/nixos/nix:2.16.0 nix --extra-experimental-features 'flakes nix-command' --accept-flake-config eval gitlab:moduon/mrchef/009d93eb7f0a9eabd3b248f5c10c7523422a7f56#devShells.x86_64-linux.default.outPath
warning: refname 'HEAD' is ambiguous.
"/nix/store/ywpr4b5d28gsvngq2q6bj755ambcasa9-devshell"

I did a slight cleanup of the output, as my terminal messed it up during resize. I hope I haven't removed relavant things…

Anyway, there is a warning that concerns me and which also might possibly be related:

warning: refname 'HEAD' is ambiguous.

I currently do not have the time to dig through your code to figure out where to problem arises.

Therefore I also can not say, nor quickly check if this reproducibility issue actually is your fault or nix'.

NobbZ commented 1 year ago

I identified the commit that causes this (e4aebccf209c12714f5547ec5acf08b1fd7f51cc):

commit e4aebccf209c12714f5547ec5acf08b1fd7f51cc
Author: Eelco Dolstra <edolstra@gmail.com>
Date:   Sat May 27 17:53:30 2023 +0200

    Restore Nix 2.3 behaviour for {__impure,__contentAddressed} = false

    Fixes #8405.

 src/libexpr/primops.cc | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

git bisect log

``` git bisect start # status: waiting for both good and bad commits # bad: [1ac5f9eac2703c2213cfc522c92bd5c141818fac] Mark official release git bisect bad 1ac5f9eac2703c2213cfc522c92bd5c141818fac # status: waiting for good commit(s), bad commit known # good: [9204ab4d5873c56ab47f96329e31f5430fc15069] Merge pull request #8350 from NixOS/backport-8154-to-2.15-maintenance git bisect good 9204ab4d5873c56ab47f96329e31f5430fc15069 # good: [0a546242755890e41099cf4aa49762c916876748] Merge pull request #8200 from edolstra/release-notes git bisect good 0a546242755890e41099cf4aa49762c916876748 # good: [b5d9ef0a4ce621a0f022bba7105d21d269e6455c] Merge pull request #3921 from obsidiansystems/trustless-remote-builder-simple git bisect good b5d9ef0a4ce621a0f022bba7105d21d269e6455c # good: [e97e9e9f0055d02c16a3dc8e7891c06d643aaf82] test the garbage collection with the old profile dir git bisect good e97e9e9f0055d02c16a3dc8e7891c06d643aaf82 # good: [b1c34c0ee808f60da32f09e238087106b9c4f680] Merge pull request #8366 from obsidiansystems/worker-proto-forward-decl-types git bisect good b1c34c0ee808f60da32f09e238087106b9c4f680 # good: [940e9eb8dd6deacb2e41aca91d65bd0e616011d6] Merge pull request #8240 from tweag/macos-sandbox git bisect good 940e9eb8dd6deacb2e41aca91d65bd0e616011d6 # bad: [52004696c080db6640af9a4e0d1e6f56d094cd2b] Merge pull request #8413 from doronbehar/doc/distributed-builds/clearer-warning git bisect bad 52004696c080db6640af9a4e0d1e6f56d094cd2b # bad: [ecae62020b64914d9859a71ce197d03688c6133c] Merge pull request #8406 from NixOS/fix-ca-attrs-false git bisect bad ecae62020b64914d9859a71ce197d03688c6133c # good: [6b56376c89a84388da4f25792c4828baee3cc5e4] Bump cachix/install-nix-action from 20 to 21 git bisect good 6b56376c89a84388da4f25792c4828baee3cc5e4 # bad: [e4aebccf209c12714f5547ec5acf08b1fd7f51cc] Restore Nix 2.3 behaviour for {__impure,__contentAddressed} = false git bisect bad e4aebccf209c12714f5547ec5acf08b1fd7f51cc # first bad commit: [e4aebccf209c12714f5547ec5acf08b1fd7f51cc] Restore Nix 2.3 behaviour for {__impure,__contentAddressed} = false ```

As the commit says, it was changing how __impure and __contentAddressed arguments in a drv work, it is very likely that you have one of those in your dependency graph with an explicit = false.

Given #8405 the current behaviour is correct and 2.15 (and before) behaved wrong.

From my understanding it is impossible to fix both #8405 and this at the same time. One of these issues will have to be closed as a "wont-fix", sorry.

Not sure whom to ping here to receive confirmation.

puckipedia commented 1 year ago

Repeating what NobbZ said; unless you're setting __contentAddressed = false; manually somewhere, the issue should really only show up if you use one of the nixpkgs commits after nodejs was explicitly set to not being content-addressed, and before this bug was worked around in nixpkgs.

Though, manually looking at all three of your nixpkgs revisions (why's there three??) i don't instantly see a revision in this situation. If you can run nix --extra-experimental-features nix-command show-derivation -r on both versions and find out if it really is a __contentAddressed (or __impure?) that'd probably let you figure it out.

NixOS / nix

Nix 2.16 broke something in the way binary substituters are used #8507