Open gador opened 1 day ago
I've had this issue too. It doesn't just hang, in goes into disk sleep. Meaning you can't kill it, not even by shutting down the system.
Yes! Not even sudo kill -9 $PID
does help. Only restarting the whole system works. I'm trying to disect, where it actually goes wrong, but I believe it has something to do with the new chroot
safety feature from https://github.com/NixOS/nix/commit/0e4baff868047f065749c9ba73556bf8d90fabf7
I confirmed my suspicion.
I have the following diff on the current 2.24.10
version
diff --git a/src/libstore/unix/build/local-derivation-goal.cc b/src/libstore/unix/build/local-derivation-goal.cc
index 2a09e3dd4..baeae54f8 100644
--- a/src/libstore/unix/build/local-derivation-goal.cc
+++ b/src/libstore/unix/build/local-derivation-goal.cc
@@ -509,11 +509,11 @@ void LocalDerivationGoal::startBuilder()
/* Create a temporary directory where the build will take
place. */
topTmpDir = createTempDir(settings.buildDir.get().value_or(""), "nix-build-" + std::string(drvPath.name()), false, false, 0700);
-#if __APPLE__
+//#if __APPLE__
if (false) {
-#else
- if (useChroot) {
-#endif
+//#else
+// if (useChroot) {
+//#endif
/* If sandboxing is enabled, put the actual TMPDIR underneath
an inaccessible root-owned directory, to prevent outside
access.
which basically reverts https://github.com/NixOS/nix/commit/0e4baff868047f065749c9ba73556bf8d90fabf7
and used this as nix.package
in a VM to test the build. I then ran nix build github:nixos/nixpkgs/nixos-unstable#pgadmin4 --rebuild -L
and it did work !
Doing this on any newer nix version without the above diff fails. So this is exactly the reason. yarn
(for whatever reason) does either not like the subdirectory /build
(which is unlikely) or the permission 700
.
Not sure how to tackle this problem, though. It is unlikely that pgadmin
is the only victim here. And that you have to restart the whole system to kill a bunch of node yarn install ...
processes isn't cool either.
@thufschmitt any idea here? Also, in light of ZHF #352882 a bit of a pressing problem
I haven't seen this before. I'm not much of a darwin expert, but here's my thoughts.
The directory names got longer, and unix sockets have a very restricted length on darwin. Some software does not expect a long(er) TMPDIR and may not handle that correctly, leading to undefined/strange behavior.
Although strace
didn't reveal much, it might be worth comparing a hanging run to a successful run, especially if the execution is deterministic, which makes a semi-automated comparison much easier.
Is each node in this chain of directories that makes up TMPDIR readable (+rx) by the sandboxed build process? If not, would it be ok to make it readable only by the build user? This is slightly less secure, but might be ok.
This could probably be fixed on either side, Nix or yarn. Could you open an issue on the https://github.com/NixOS/nix repo for the regression? It'd help to get more eyes on this. (I'd move the issue if it was clearly one or the other, fwiw)
Another practical note: @thufschmitt has changed jobs and isn't contributing actively to the Nix/NixOS ecosystem anymore.
I haven't seen this before. I'm not much of a darwin expert, but here's my thoughts.
I have the same issue on Linux. There is nothing really suspicous in lsof
either.
pnpm 191060 nixbld1 cwd DIR 0,36 40 407492 /build/source (deleted)
pnpm 191060 nixbld1 rtd DIR 259,2 4096 41432658 /
... lots of /nix/store paths, anon_inode io_uring, pipes ...
As I don't see it explicitly named: I think it is definitely not yarn only (pnpm is shown in the prev. comment). I had observed similar issue, when trying to build stalwart-mail.webadmin when trying to reproduce a recent build failure. The said package uses npm (same symptoms: never finished, 0 activity, can't kill -9, shutdown blocked).
I was running a maybe 1-2 weeks old nixos-unstable. Let me know if I should try to reproduce and gather some Infos.
@roberth thanks for chiming in. This is a non darwin issue. As it is only present when the code is executed on an non APPLE
system.
I can build pgadmin
just fine on 2.24.9
on aarch64-darwin
. My "patch" above just disables the chroot
condition for all systems
Also, even worse, when trying to build pgadmin
on linux: Due to being unkillable, the system will not reboot nor shutdown! It will hang forever on a watchdog issue and the system needs to be powered down by hand. This can be a huge issue for bare-metal servers
Is each node in this chain of directories that makes up TMPDIR readable (+rx) by the sandboxed build process?
AFAIS, yes.
ls -la /tmp
drwx------ 3 root root 3 Nov 5 06:21 nix-build-pgadmin-8.11.drv-1
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1
drwx------ 5 nixbld1 nixbld 7 Nov 5 06:21 build
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1/build
total 64
drwx------ 5 nixbld1 nixbld 7 Nov 5 06:21 .
drwx------ 3 root root 3 Nov 5 06:21 ..
drwxr-xr-x 3 nixbld1 nixbld 3 Nov 5 06:21 .cache
-rw------- 1 nixbld1 nixbld 35469 Nov 5 06:21 env-vars
drwxr-xr-x 9 nixbld1 nixbld 19 Nov 5 06:21 source
drwxr-xr-x 3 nixbld1 nixbld 3 Nov 5 06:21 v8-compile-cache-1000
-rw-r--r-- 1 nixbld1 nixbld 160 Nov 5 06:21 .yarnrc
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1/build/.cache/yarn/v6
[...]
drwxr-xr-x 3 nixbld1 nixbld 3 Nov 5 06:21 npm-yarn-audit-html-4.0.0-dc04c9cf83e758fd6d9efad8c96df1fc8c4bf30c
drwxr-xr-x 3 nixbld1 nixbld 3 Nov 5 06:21 npm-yauzl-2.10.0-c7eb17c93e112cb1086fa6d8e51fb0667b79a5f9
drwxr-xr-x 3 nixbld1 nixbld 3 Nov 5 06:21 npm-yocto-queue-0.1.0-0294eb3dee05028d31ee1a5fa2c556a6aaf10a1b
drwxr-xr-x 3 nixbld1 nixbld 3 Nov 5 06:21 npm-yocto-queue-1.1.1-fef65ce3ac9f8a32ceac5a634f74e17e5b232110
drwxr-xr-x 3 nixbld1 nixbld 3 Nov 5 06:21 npm-zustand-4.5.4-63abdd81edfb190bc61e0bbae045cc4d52158a05
drwxr-xr-x 2 nixbld1 nixbld 2 Nov 5 06:21 .tmp
This could probably be fixed on either side, Nix or yarn. Could you open an issue on the https://github.com/NixOS/nix repo for the regression?
done
Also seeing this on a x86-64 linux machine running hydra, the command npm ci
runs forever and kill -9
does nothing.
Same here on my x86-64 linux development VM. I did a nixos-rebuild switch --upgrade
yesterday and since then the problem happens with npm ci
and npm install
.
@datafoo when was your last known good commit?
I investigated further and I narrowed it down to something between these commits:
broken 4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0 2024-10-18 1809433 good a3c0b3b21515f74fd2665903d4ce6bc4dc81c77c 2024-10-14 1809364
Tested as the input for a NixOS VM with a fixed nix.package = pkgs.nixVersions.nix_2_24;
and always trying to build nix build -L --rebuild github:nixos/nixpkgs/nixos-unstable#pgadmin4
With the broken commit, this stalls. With the good commit this continues on and builds. Since the derivation to build is fixed (and so are all the inputs e.g. yarn or node), this obviously has something to do with the build environment. And this changed between those commits.
I haven't found an easy culprit with git --diff
, yet.
On my system, manually (as in: typing it into my terminal) running npm ci
in a repo also hangs the npm ci
process. The build is not running through nix
. The process is un-sigkill-able.
My system is running on nixpkgs commit 807e9154dcb16384b1b765ebe9cd2bba2ac287fd.
Edit: Steps to reproduce (at least on my machine):
cd
into a project that uses npm
. (I don't yet know if this works on all repos or only more complicated ones.)rm -r node_modules
npm ci
. Note that this time, it completes and exits successfully, as expected.npm ci
immediately afterwards (may be time sensitive). Note that it appears to hang, the little spinner spinning indefinitely, without any other output.npm ci
process still exists, but now in its un-SIGKILL-able state. Since the process still exists, you are not dumped back in your shell prompt either.I kept running npm ci
in different ways (but in the same repository). Roughly every second npm ci
call seemed to get stuck. These patterns seemed to hold most of the time:
npm ci
, an immediate rerun seems to get stuck.npm ci
, an immediate rerun seems to succeed.npm ci
, a rerun after a wait of a minute or so seems to succeed or get stuck randomly.npm ci
, a rerun after a wait of a minute or so seems to succeed.Possibly related:
Does downgrading to npm 10.3.0 work for you?
Bun and Deno seem to not be affected.
@Garmelon I think this is an unrelated bug. What I described here is a bug in a build process from nix, which always uses the same node
and yarn
version and fails or succeeds depending on the host machine's NixOS version. This is why I suspect nix
to be involved.
@donovanglover I cannot rule out a random hang on the build process. But as of know it consistently works or consistently fails depending on the commit of the build machine
Describe the bug
Currently
yarn install
hangs at the steplinking dependencies...
Steps To Reproduce
Steps to reproduce the behavior: 1.Try to build
pgadmin4
on master 2.Wait forlinking dependencies...
or just run
nix build github:nixos/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2#pgadmin --rebuild
Expected behavior
yarn install
should continue with the install processAdditional context
I've noticed this issue on an unrelated small bugfix in
pgadmin4
which caused a rebuild, which did not work. (#353092). Ofborg worked just fine, which is why I merged this small fix, but the package never did build on my system. Neither does it currently on hydra (See e.g. https://hydra.nixos.org/build/277185860/nixlog/1)I'm not sure what changed, since nothing substantially changed on the package. I've also tried to re-run the update script which resulted in exactly the same
yarn.lock
.Running
strace
orlsof
did not result in any trace of the issue.Also, interestingly, running
--check
on an oldernixos-unstable
pgadmin4
derivation fails to build at the same step.Is there anything in the nix builder, which changed sandbox or build behavior which stalled
yarn
? I've looked at https://github.com/NixOS/nix/pull/10312 which changed stuff related to the sandbox and found an old unpatched nix version in24.05
(which is running nix version 2.18.2 which according to https://github.com/NixOS/nix/security/advisories/GHSA-q82p-44mg-mgh5 hasn't been fixed, yet) and it does compile the currentpgadmin4
just fine!This does not work with a patched nix version (doesn't matter whether its 2.18.4 or newer)
So the patch to fix the build-dir seems to have broken at least pgadmin.
Notify maintainers
@roberth
Metadata
Add a :+1: reaction to issues you find important.