NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.79k stars 1.52k forks source link

nix-channel --update hangs indefinitely #3236

Open chkno opened 4 years ago

chkno commented 4 years ago

I have a nix-channel --update that's been stuck for three days.

$ ps ax | grep nix-channel
11439 ?        Sl     0:29 /nix/store/dj2ylmvx2c288h51yx8ljg4h6bh5vyla-nix-2.3/bin/nix-channel --update

$ beenrunning() { units -v1 "($(cut -d" " -f1 /proc/uptime) - $(cut -d" " -f22 /proc/$1/stat)/100)s" days; }
$ beenrunning 11439
        (347589.62 - 8627974/100)s = 3.0244199 days

It appears to have printed this before hanging:

waiting for lock on '/root/.cache/nix/tarballs/0s9swzqhys99sk5ndz23z2w13mbw9wd14ff58nfxyhss7a986jn9-file'...

Stack trace of the main thread (full trace of all threads):

#0  in syscall () from ...glibc-2.27/lib/libc.so.6
#1  in std::__atomic_futex_unsigned_base::_M_futex_wait_until(...) () from ...gcc-8.3.0-lib/lib/libstdc++.so.6
#2  in nix::Downloader::download(nix::DownloadRequest const&) () from ...nix-2.3/lib/libnixstore.so
#3  in nix::Downloader::downloadCached(nix::ref<nix::Store>, nix::CachedDownloadRequest const&) () from ...nix-2.3/lib/libnixstore.so
#4  in _main(int, char**) ()
#5  in nix::mainWrapped(int, char**) ()
#6  in nix::handleExceptions(string, function) () from ...nix-2.3/lib/libnixmain.so
#7  in main ()

This is the only process that has that file open, and it has the lock (the "W" for write-lock in "9uW"):

# lsof -n | grep 0s9swzqhys99sk5ndz23z2w13mbw9wd14ff58nfxyhss7a986jn9-file.lock
  nix-chann 11439                  root  9uW  REG  254,2  0  59779377 /root/.cache/nix/tarballs/0s9...-file.lock
  nix-chann 11439 11440 nix-chann  root  9uW  REG  254,2  0  59779377 /root/.cache/nix/tarballs/0s9...-file.lock
  nix-chann 11439 11441 nix-chann  root  9uW  REG  254,2  0  59779377 /root/.cache/nix/tarballs/0s9...-file.lock
  nix-chann 11439 11442 nix-chann  root  9uW  REG  254,2  0  59779377 /root/.cache/nix/tarballs/0s9...-file.lock
  nix-chann 11439 11443 nix-chann  root  9uW  REG  254,2  0  59779377 /root/.cache/nix/tarballs/0s9...-file.lock
  nix-chann 11439 11444 nix-chann  root  9uW  REG  254,2  0  59779377 /root/.cache/nix/tarballs/0s9...-file.lock

This stuck process blocks other invocations of nix-channel --update as well, causing them to emit waiting for lock on '/root/.cache/nix/tarballs/...-file'... and then hang indefinitely, but with a different backtrace:

(gdb) bt
#0 in flock () from ...-glibc-2.27/lib/libc.so.6
#1 in nix::lockFile(int, nix::LockType, bool) () from ...nix-2.3/lib/libnixstore.so
#2 in nix::PathLocks::lockPaths(set<string>, string, bool) () from ...-nix-2.3/lib/libnixstore.so
#3 in nix::PathLocks::PathLocks(set<string>, string) () from ...nix-2.3/lib/libnixstore.so
#4 in nix::Downloader::downloadCached(nix::ref<nix::Store>, nix::CachedDownloadRequest const&) () from ...-nix-2.3/lib/libnixstore.so
#5 in _main(int, char**) ()
#6 in nix::mainWrapped(int, char**) ()
#7 in nix::handleExceptions(string, function) () from ...nix-2.3/lib/libnixmain.so
#8 in main ()

The list of channels it was trying to fetch:

# nix-channel --list
nixos https://nixos.org/channels/nixos-19.09
nixos-hardware https://github.com/NixOS/nixos-hardware/archive/master.tar.gz
nixos-unstable https://nixos.org/channels/nixos-unstable

It looks like the one that got stuck is the 19.09 channel metadata:

# ls -l /root/.cache/nix/tarballs/0s9swzqhys99sk5ndz23z2w13mbw9wd14ff58nfxyhss7a986jn9-file*
lrwxrwxrwx 1 root root 55 Nov 18 16:04 /root/.cache/nix/tarballs/0s9swzqhys99sk5ndz23z2w13mbw9wd14ff58nfxyhss7a986jn9-file -> /nix/store/l67ybb9qn0y2l77v4j8jc4q92mb9mbmh-nixos-19.09
-rw------- 1 root root  0 Nov  4 10:32 /root/.cache/nix/tarballs/0s9swzqhys99sk5ndz23z2w13mbw9wd14ff58nfxyhss7a986jn9-file.lock

The file looks like it was successfully fetched. It's an html file with this inside:

$ lynx -dump -force-html -nolist -width=140 /nix/store/l67ybb9qn0y2l77v4j8jc4q92mb9mbmh-nixos-19.09
                                              nixos-19.09 release nixos-19.09.1282.e6d584f6dd2

   Released on 2019-11-18 18:59:01 from Git commit e6d584f6dd22b587d5cdf5019f5e7dd2be370f61 via Hydra evaluation 1555352.

                          File name                           Size                              SHA-256 hash
   binary-cache-url                                                23 eb8d2a51d85521c07092ed8126df19bb6f9d7c16a5275395a0e110868a4ac6c0
   git-revision                                                    40 968416ccade5c065ab3f5e29078537b3aa75eb2d1ab999c592d042e38f39bc07
   nixexprs.tar.xz                                           13867856 611606b1c5a7824687b090b42094f6f43386aed4f6016f575146e5a01cf6b338
   nixos-19.09.1282.e6d584f6dd2-x86_64-linux.ova           1412923392 f19e98a088a0eaee85de018ab6b6e629b9fdf8d7a64bfa935ed0b8a18764bd18
   nixos-graphical-19.09.1282.e6d584f6dd2-x86_64-linux.iso 1343225856 bea34e1c59986967e979ebc981fd4b71263be6ec5de98ad4f89d6e493144b7ad
   nixos-minimal-19.09.1282.e6d584f6dd2-i686-linux.iso      558891008 225910221b2506588d6ee7eee118143fd76d6f46b0f72345f27deb4cd1b1d260
   nixos-minimal-19.09.1282.e6d584f6dd2-x86_64-linux.iso    564133888 c4358112ba93c4544119770363a955f401a3e6a4194b91920950d9b1c70ce257
   src-url                                                         36 ab86d9d20e41a772ee3a08972f0f834fbf7186a10e34cab5417fab973901e5fe
   store-paths.xz                                             1068860 092492469016adff8958e26e6674e2db378eb0d388459ca561fb6791657cfee5
ktor commented 4 years ago

+1, having the same issue right now.

chkno commented 4 years ago

Any other info I should extract from this stuck process before killing it?

$ beenrunning 11439
        (1272390.24 - 8627974/100)s = 13.728131 days

While this is a minor problem for interactive use where an impatient user can hit ^C, it's a more serious problem for mechanisms like system.autoUpgrade.enable where this happens in the background unnoticed and updates just stop indefinitely.

ktor commented 4 years ago

@chkno do you know how to workaround the problem and get the upgrade to work again? For me it's a blocker, I cannot do nixos-rebuild --upgrade switch because of that.

chkno commented 4 years ago

After running kill 11439 (the PID of the stuck nix-channel --update) I was able to update channels again.

Upon being killed, the process that had been stuck for two weeks printed this message (the same message it prints if you run it and immediately press control-C):

warning: warning: download of 'https://nixos.org/channels/nixos-19.09' was interrupted; using cached result
error: interrupted by the user
ktor commented 4 years ago

I've removed file it was waiting for: /root/.cache/nix/tarballs/0s9swzqhys99sk5ndz23z2w13mbw9wd14ff58nfxyhss7a986jn9-file

Restarted and it works. I have no idea if restarting or removing the file has worked.

timor commented 4 years ago

Same problem here, already happened twice, preventing automatic updates without noticing.

timor commented 4 years ago

So far we encountered this problem on two machines, which both had in common that they ran on BTRFS, while our other machines run on different file systems.

@ktor, @chkno, may I ask which file system you are using?

ktor commented 4 years ago

@timor ext4

chkno commented 4 years ago

ext4

timor commented 4 years ago

Ok, thanks.

ktor commented 4 years ago

It helps to kill a nix-channel --update nixos process.

timor commented 4 years ago

Right now I think the reason for this problem on my machine is network-related. It seems to happen when I lose/don't have network connection.

chkno commented 4 years ago

Possibly related: #3338 . Possible workaround: set the nix option download-attempts=1

ktor commented 4 years ago

We'll see if it'll be fixed with curl 7.67+ as Elco suggested in #3338

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

tobiasBora commented 3 years ago

Any idea how to solve that issue? During a nixos-rebuild I get a similar error:

waiting for lock on '/nix/store/h8il05l5iazsvb2cahk3iybq3iyg1ja0-scout-runtime-0.20210527.0.tar.gz'...

I tried to kill process, it's not working. And I can't remove/chmod this file:

sudo rm -rf /nix/store/h8il05l5iazsvb2cahk3iybq3iyg1ja0-scout-runtime-0.20210527.0.tar.gz.lock 
rm: cannot remove '/nix/store/h8il05l5iazsvb2cahk3iybq3iyg1ja0-scout-runtime-0.20210527.0.tar.gz.lock': Read-only file system

I'd also love to avoid a nix-collect-garbage because I spent already a day downloading stuff...

EDIT: the error magically vanished after a failed update...

afreakk commented 2 years ago

Any idea how to solve that issue? During a nixos-rebuild I get a similar error:

waiting for lock on '/nix/store/h8il05l5iazsvb2cahk3iybq3iyg1ja0-scout-runtime-0.20210527.0.tar.gz'...

I tried to kill process, it's not working. And I can't remove/chmod this file:

sudo rm -rf /nix/store/h8il05l5iazsvb2cahk3iybq3iyg1ja0-scout-runtime-0.20210527.0.tar.gz.lock 
rm: cannot remove '/nix/store/h8il05l5iazsvb2cahk3iybq3iyg1ja0-scout-runtime-0.20210527.0.tar.gz.lock': Read-only file system

I'd also love to avoid a nix-collect-garbage because I spent already a day downloading stuff...

EDIT: the error magically vanished after a failed update...

I had similar issue, but after a reboot (dont know if reboot was needed) and doing:

❯ nix-store --delete /nix/store/srxygvvwgpm5w7lp7kpjqg4l2n7v814p-unit-nix-daemon.service.lock
finding garbage collector roots...
deleting '/nix/store/srxygvvwgpm5w7lp7kpjqg4l2n7v814p-unit-nix-daemon.service.lock'
deleting unused links...
note: currently hard linking saves 36985.18 MiB
1 store paths deleted, 0.00 MiB freed

with the lockfiles, they were removed.

jsoo1 commented 2 years ago

I think I am seeing something similar but in the https://github.com/nix-community/nix-eval-jobs repository. Downloads seem to hang when using fetch in primops.cc. There is one qualifying factor here, which is that we are still using nix@2.3. I tracked down the issue in gdb. It seems as though the enqueuing thread properly mutates the download thread's incoming queue but the downloading thread never sees the queue update. Specifically the enqueuing thread pushes the DownloadItem onto the queue in nix::CurlDownloader::enqueueItem and nix::Downloader::workerThreadMain never sees the item in the queue.

I am mentioning it because I am a little at a loss as to why that would be.

jsoo1 commented 2 years ago

I think the hang I was seeing in nix-eval-jobs was my misunderstanding the address space of the forks. I don't think the hang is the same as this one, now. Just the symptoms.

hydrojava commented 1 year ago

yes