NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.05k stars 14.08k forks source link

yarn install takes indefinitely #353709

Open gador opened 1 day ago

gador commented 1 day ago

Describe the bug

Currently yarn install hangs at the step linking dependencies...

Steps To Reproduce

Steps to reproduce the behavior: 1.Try to build pgadmin4 on master 2.Wait for linking dependencies...

  1. ...

or just run nix build github:nixos/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2#pgadmin --rebuild

Expected behavior

yarn install should continue with the install process

Additional context

I've noticed this issue on an unrelated small bugfix in pgadmin4 which caused a rebuild, which did not work. (#353092). Ofborg worked just fine, which is why I merged this small fix, but the package never did build on my system. Neither does it currently on hydra (See e.g. https://hydra.nixos.org/build/277185860/nixlog/1)

I'm not sure what changed, since nothing substantially changed on the package. I've also tried to re-run the update script which resulted in exactly the same yarn.lock.

Running strace or lsof did not result in any trace of the issue.

Also, interestingly, running --check on an older nixos-unstable pgadmin4 derivation fails to build at the same step.

Is there anything in the nix builder, which changed sandbox or build behavior which stalled yarn ? I've looked at https://github.com/NixOS/nix/pull/10312 which changed stuff related to the sandbox and found an old unpatched nix version in 24.05 (which is running nix version 2.18.2 which according to https://github.com/NixOS/nix/security/advisories/GHSA-q82p-44mg-mgh5 hasn't been fixed, yet) and it does compile the current pgadmin4 just fine!

This does not work with a patched nix version (doesn't matter whether its 2.18.4 or newer)

So the patch to fix the build-dir seems to have broken at least pgadmin.

Notify maintainers

@roberth

Metadata


Add a :+1: reaction to issues you find important.

FliegendeWurst commented 1 day ago

I've had this issue too. It doesn't just hang, in goes into disk sleep. Meaning you can't kill it, not even by shutting down the system.

gador commented 1 day ago

Yes! Not even sudo kill -9 $PID does help. Only restarting the whole system works. I'm trying to disect, where it actually goes wrong, but I believe it has something to do with the new chroot safety feature from https://github.com/NixOS/nix/commit/0e4baff868047f065749c9ba73556bf8d90fabf7

gador commented 1 day ago

I confirmed my suspicion. I have the following diff on the current 2.24.10 version

diff --git a/src/libstore/unix/build/local-derivation-goal.cc b/src/libstore/unix/build/local-derivation-goal.cc
index 2a09e3dd4..baeae54f8 100644
--- a/src/libstore/unix/build/local-derivation-goal.cc
+++ b/src/libstore/unix/build/local-derivation-goal.cc
@@ -509,11 +509,11 @@ void LocalDerivationGoal::startBuilder()
     /* Create a temporary directory where the build will take
        place. */
     topTmpDir = createTempDir(settings.buildDir.get().value_or(""), "nix-build-" + std::string(drvPath.name()), false, false, 0700);
-#if __APPLE__
+//#if __APPLE__
     if (false) {
-#else
-    if (useChroot) {
-#endif
+//#else
+//    if (useChroot) {
+//#endif
         /* If sandboxing is enabled, put the actual TMPDIR underneath
            an inaccessible root-owned directory, to prevent outside
            access.

which basically reverts https://github.com/NixOS/nix/commit/0e4baff868047f065749c9ba73556bf8d90fabf7 and used this as nix.package in a VM to test the build. I then ran nix build github:nixos/nixpkgs/nixos-unstable#pgadmin4 --rebuild -L and it did work !

Doing this on any newer nix version without the above diff fails. So this is exactly the reason. yarn (for whatever reason) does either not like the subdirectory /build (which is unlikely) or the permission 700.

Not sure how to tackle this problem, though. It is unlikely that pgadmin is the only victim here. And that you have to restart the whole system to kill a bunch of node yarn install ... processes isn't cool either.

@thufschmitt any idea here? Also, in light of ZHF #352882 a bit of a pressing problem

roberth commented 1 day ago

I haven't seen this before. I'm not much of a darwin expert, but here's my thoughts.

The directory names got longer, and unix sockets have a very restricted length on darwin. Some software does not expect a long(er) TMPDIR and may not handle that correctly, leading to undefined/strange behavior.

Although strace didn't reveal much, it might be worth comparing a hanging run to a successful run, especially if the execution is deterministic, which makes a semi-automated comparison much easier.

Is each node in this chain of directories that makes up TMPDIR readable (+rx) by the sandboxed build process? If not, would it be ok to make it readable only by the build user? This is slightly less secure, but might be ok.

This could probably be fixed on either side, Nix or yarn. Could you open an issue on the https://github.com/NixOS/nix repo for the regression? It'd help to get more eyes on this. (I'd move the issue if it was clearly one or the other, fwiw)

Another practical note: @thufschmitt has changed jobs and isn't contributing actively to the Nix/NixOS ecosystem anymore.

FliegendeWurst commented 1 day ago

I haven't seen this before. I'm not much of a darwin expert, but here's my thoughts.

I have the same issue on Linux. There is nothing really suspicous in lsof either.

pnpm    191060 nixbld1 cwd       DIR   0,36       40    407492 /build/source (deleted)
pnpm    191060 nixbld1 rtd       DIR  259,2     4096  41432658 /
... lots of /nix/store paths, anon_inode io_uring, pipes ...
Shawn8901 commented 1 day ago

As I don't see it explicitly named: I think it is definitely not yarn only (pnpm is shown in the prev. comment). I had observed similar issue, when trying to build stalwart-mail.webadmin when trying to reproduce a recent build failure. The said package uses npm (same symptoms: never finished, 0 activity, can't kill -9, shutdown blocked).

I was running a maybe 1-2 weeks old nixos-unstable. Let me know if I should try to reproduce and gather some Infos.

gador commented 1 day ago

@roberth thanks for chiming in. This is a non darwin issue. As it is only present when the code is executed on an non APPLE system. I can build pgadmin just fine on 2.24.9 on aarch64-darwin. My "patch" above just disables the chroot condition for all systems

Also, even worse, when trying to build pgadmin on linux: Due to being unkillable, the system will not reboot nor shutdown! It will hang forever on a watchdog issue and the system needs to be powered down by hand. This can be a huge issue for bare-metal servers

gador commented 1 day ago

Is each node in this chain of directories that makes up TMPDIR readable (+rx) by the sandboxed build process?

AFAIS, yes.

ls -la /tmp
drwx------  3 root  root     3 Nov  5 06:21 nix-build-pgadmin-8.11.drv-1
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1
drwx------  5 nixbld1 nixbld  7 Nov  5 06:21 build
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1/build
total 64
drwx------ 5 nixbld1 nixbld     7 Nov  5 06:21 .
drwx------ 3 root    root       3 Nov  5 06:21 ..
drwxr-xr-x 3 nixbld1 nixbld     3 Nov  5 06:21 .cache
-rw------- 1 nixbld1 nixbld 35469 Nov  5 06:21 env-vars
drwxr-xr-x 9 nixbld1 nixbld    19 Nov  5 06:21 source
drwxr-xr-x 3 nixbld1 nixbld     3 Nov  5 06:21 v8-compile-cache-1000
-rw-r--r-- 1 nixbld1 nixbld   160 Nov  5 06:21 .yarnrc
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1/build/.cache/yarn/v6
[...]
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yarn-audit-html-4.0.0-dc04c9cf83e758fd6d9efad8c96df1fc8c4bf30c
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yauzl-2.10.0-c7eb17c93e112cb1086fa6d8e51fb0667b79a5f9
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yocto-queue-0.1.0-0294eb3dee05028d31ee1a5fa2c556a6aaf10a1b
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yocto-queue-1.1.1-fef65ce3ac9f8a32ceac5a634f74e17e5b232110
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-zustand-4.5.4-63abdd81edfb190bc61e0bbae045cc4d52158a05
drwxr-xr-x    2 nixbld1 nixbld    2 Nov  5 06:21 .tmp

This could probably be fixed on either side, Nix or yarn. Could you open an issue on the https://github.com/NixOS/nix repo for the regression?

done

blurgyy commented 16 hours ago

Also seeing this on a x86-64 linux machine running hydra, the command npm ci runs forever and kill -9 does nothing.

datafoo commented 14 hours ago

Same here on my x86-64 linux development VM. I did a nixos-rebuild switch --upgrade yesterday and since then the problem happens with npm ci and npm install.

gador commented 14 hours ago

@datafoo when was your last known good commit?

gador commented 8 hours ago

I investigated further and I narrowed it down to something between these commits:

broken 4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0 2024-10-18 1809433 good a3c0b3b21515f74fd2665903d4ce6bc4dc81c77c 2024-10-14 1809364

Tested as the input for a NixOS VM with a fixed nix.package = pkgs.nixVersions.nix_2_24; and always trying to build nix build -L --rebuild github:nixos/nixpkgs/nixos-unstable#pgadmin4 With the broken commit, this stalls. With the good commit this continues on and builds. Since the derivation to build is fixed (and so are all the inputs e.g. yarn or node), this obviously has something to do with the build environment. And this changed between those commits.

I haven't found an easy culprit with git --diff, yet.

Garmelon commented 6 hours ago

On my system, manually (as in: typing it into my terminal) running npm ci in a repo also hangs the npm ci process. The build is not running through nix. The process is un-sigkill-able.

My system is running on nixpkgs commit 807e9154dcb16384b1b765ebe9cd2bba2ac287fd.

Edit: Steps to reproduce (at least on my machine):

  1. cd into a project that uses npm. (I don't yet know if this works on all repos or only more complicated ones.)
  2. Run rm -r node_modules
  3. Run npm ci. Note that this time, it completes and exits successfully, as expected.
  4. Run npm ci immediately afterwards (may be time sensitive). Note that it appears to hang, the little spinner spinning indefinitely, without any other output.
  5. Press Ctrl+C. Note that the npm ci process still exists, but now in its un-SIGKILL-able state. Since the process still exists, you are not dumped back in your shell prompt either.

I kept running npm ci in different ways (but in the same repository). Roughly every second npm ci call seemed to get stuck. These patterns seemed to hold most of the time:

  1. After a successful run of npm ci, an immediate rerun seems to get stuck.
  2. After a stuck run of npm ci, an immediate rerun seems to succeed.
  3. After a successful run of npm ci, a rerun after a wait of a minute or so seems to succeed or get stuck randomly.
  4. After a stuck run of npm ci, a rerun after a wait of a minute or so seems to succeed.
donovanglover commented 1 hour ago

Possibly related:

Does downgrading to npm 10.3.0 work for you?

donovanglover commented 1 hour ago

Bun and Deno seem to not be affected.

gador commented 37 minutes ago

@Garmelon I think this is an unrelated bug. What I described here is a bug in a build process from nix, which always uses the same node and yarn version and fails or succeeds depending on the host machine's NixOS version. This is why I suspect nix to be involved. @donovanglover I cannot rule out a random hang on the build process. But as of know it consistently works or consistently fails depending on the commit of the build machine