Open elaforge opened 5 years ago
I marked this as stale due to inactivity. → More info
Still relevant
I marked this as stale due to inactivity. → More info
I have run into this issue multiple times and seemingly at random in my own CI. Here is an example where it hung for over 25 minutes until I manually cancelled it, strangely when I did hit the cancel button the job exited successfully, which was a surprise. https://github.com/nrdxp/nrdos/actions/runs/3357138657/jobs/5562747027
I did another run right after to try and get more information by slapping a --debug
on the nix copy
command. Alas it did not hang on that run:
https://github.com/nrdxp/nrdos/actions/runs/3357218857/jobs/5562877628#step:3:4776
The line I link to here is usually where it hangs though, which seems to be after an ssh connection is already made.
My suspicion, since the hung job mentioned above actually exited successful and even shows successful uploads in the logs (that portion of the log only showed up after I cancelled it) is that there is some sort of a dead lock in the parallel upload code. Just a guess though.
I just noticed that this seems to happen frequently when a large amount of derivations are going to be copied (either because you passed them, or because the ones you passed have a lot of dependencies to copy). I also noticed that if you just wait for a long time that the copying actually does start.
So now there are two possible explanations at this point: Does nix copy
compress all the files beforehand? That would explain why there is such a massive stall, but it doesn't seem to line up with CPU usage, which is high even during the file transfer, which I originally assumed was because it compresses on the fly.
Only other explanation I can think of is that nix copy
is just making a ton of roundtrips when trying to determine the dependency closure. Based on the behavior I observed this seems to be the most likely answer.
Update It is definitely the number of queries being done, the stall was happening in a non-interactive environment so there was no output, but if I run the same command locally then I see Nix is making a ton of queries to the cache, causing it to stall for quite some time.
NIX_SSHOPTS"-oControlMaster=no" fixes this issue as far as I can tell
But even then, ControlPath should probably be set to path which doesn't exist, since the Control socket will still be used regardless
@kittywitch very interesting, can you say how you discovered this workaround? Is nix intentionally using ControlMaster for latency reasons, or accidentally picking it up from the ambient ssh config? Also, wouldn't this only apply to ssh://
transport? I saw these timeouts from https://
, so are you sure it's the same issue?
Just as an update, we still see this, but rarely.
NIX_SSHOPTS"-oControlMaster=no" fixes this issue as far as I can tell
For me, closing my existing ssh session to the target system worked, but I'll try this next time. Almost certainly the same root cause.
In my case, I was running colmena apply
and it successfully updated one system but got stuck at Pushing system closure
on the others, while running nix-copy-closure
.
I tried running it manually with -vvvvv
:
$ nix-copy-closure --to --include-outputs --gzip -vvvvv root@m2 /nix/store/ch3vl30026v18di49035r73xyj2dd8xf-nixos-system-m2-23.05pre-git
OpenSSH_9.3p2, OpenSSL 3.0.9 30 May 2023
debug1: Reading configuration data /home/ivan/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 5: Applying options for *
debug1: auto-mux: Trying existing master
debug1: mux_client_request_session: master session id: 3
^Ckilling process 960563
error: interrupted by the user
My stack trace looks a bit different though (possibly because this is nix-copy-closure
rather than nix copy
?)
$ eu-stack -p 998166
PID 998166 - process
TID 998166:
#0 0x00007fcc9597e74c read
#1 0x00007fcc95ec011f nix::readLine[abi:cxx11](int)
#2 0x00007fcc960ce121 nix::SSHMaster::startCommand(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#3 0x00007fcc960723db nix::LegacySSHStore::openConnection()
#4 0x00007fcc96075210 std::_Function_handler<nix::ref<nix::LegacySSHStore::Connection> (), nix::LegacySSHStore::ref(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
#5 0x00007fcc960796c6 nix::Pool<nix::LegacySSHStore::Connection>::get()
#6 0x00007fcc960799d5 virtual thunk to nix::LegacySSHStore::queryValidPaths(std::set<nix::StorePath, std::less<nix::StorePath>, std::allocator<nix::StorePath> > const&, nix::SubstituteFlag)
#7 0x00007fcc960e20ee nix::copyPaths(nix::Store&, nix::Store&, std::set<nix::StorePath, std::less<nix::StorePath>, std::allocator<nix::StorePath> > const&, nix::RepairFlag, nix::CheckSigsFlag, nix::SubstituteFlag)
#8 0x00007fcc960e33d7 nix::copyPaths(nix::Store&, nix::Store&, std::set<nix::RealisedPath, std::less<nix::RealisedPath>, std::allocator<nix::RealisedPath> > const&, nix::RepairFlag, nix::CheckSigsFlag, nix::SubstituteFlag)
#9 0x00007fcc960e3d64 nix::copyClosure(nix::Store&, nix::Store&, std::set<nix::RealisedPath, std::less<nix::RealisedPath>, std::allocator<nix::RealisedPath> > const&, nix::RepairFlag, nix::CheckSigsFlag, nix::SubstituteFlag)
#10 0x000055b2a40caf1e main_nix_copy_closure(int, char**) [clone .lto_priv.0]
#11 0x000055b2a413a53e nix::mainWrapped(int, char**)
#12 0x00007fcc96259fa1 nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>)
#13 0x000055b2a40993d7 main
#14 0x00007fcc958abace __libc_start_call_main
#15 0x00007fcc958abb89 __libc_start_main@@GLIBC_2.34
#16 0x000055b2a409f6c5 _start
TID 998168:
#0 0x00007fcc958c19ea __sigtimedwait
#1 0x00007fcc958c10bc sigwait
#2 0x00007fcc95ec7105 nix::signalHandlerThread(__sigset_t)
#3 0x00007fcc95ec38cc std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(__sigset_t), __sigset_t> > >::_M_run()
#4 0x00007fcc95c515c3 execute_native_thread_routine
#5 0x00007fcc9590ddd4 start_thread
#6 0x00007fcc9598f9b0 __clone3
FWIW switching to compression=zstd
fixed the issue for me. Using the default compression method consistently hanged nix copy
for certain larger outputs (tested up to 10min before I interrupted). The same output compressed and uploaded in 5sec after switching to zstd.
FWIW switching to
compression=zstd
fixed the issue for me.
@simonzkl Where did you set compression=zstd
? I'd like to give this a try, but I see no compression
option in NixOS Options search page, or in deploy-rs docs.
FWIW switching to
compression=zstd
fixed the issue for me.@simonzkl Where did you set
compression=zstd
? I'd like to give this a try, but I see nocompression
option in NixOS Options search page, or in deploy-rs docs.
Where it accepts the Nix Store URL, e.g. in nix copy --to <url>
.
I am trying to copy closure of one of my systems to another and it is freezing all the time.
My command:
nix copy --verbose --to "ssh://<MY-MACHINE>?compress=true" .#nixosConfigurations.<MY-MACHINE>.config.system.build.toplevel
As I also have this problem quite often, I've noticed that for me it was always #5304, so that there are file locks that weren't cleaned up by previous nix commands (I suspect, maybe because they were canceled?).
To workaround this(?) issue, I wrote a simple script to obtain file locks (not sure if it's absolutely safe though, to do it like that):
#!/bin/sh
for file in $(ls /nix/store/); do
test -f /nix/store/${file}.lock && echo "/nix/store/$file.lock"
done
which lists all the stale file locks and then I remove those locks via
sudo mount -o remount,rw,bind /nix/store
sudo rm #<list of these locks>
Not optimal obviously...
Edit: As I'm just facing it again, it seems there are also locks for files/folders that don't exist (so they aren't catched by that script unfortunately). deleting them still works around the issue, but it's a little bit cumbersome...
Edit2: Maybe to improve that workaround script, check the file size to be == 0
of *.lock files (I still doubt that it's safe though...)
Here is a process that has been hung for 6 days:
elaforge 2909 0.0 0.1 514780 17684 ? Sl Jul25 0:00 /nix/store/zg66y04g2bvmw41cgrywysr86s40g5cc-nix-2.2/bin/nix copy --from https://cache.nixos.org /nix/store/jxw2sxagx9smpjklb00qzgiqgqv1zvl6-Groq.Util.Exceptions --option allowed-impure-host-deps /groq /etc/hostname --option extra-sandbox-paths /groq/models --option substituters https://cache.nixos.org http://narpile.groq --option sandbox relaxed --option require-sigs false --option fallback true --option keep-outputs true --option max-jobs 0 --option builders-use-substitutes true --option cores 0
Note that the store path doesn't exist on cache.nixos.org, and when I run by hand, I get a "path not valid" right away. But this one hung for some reason.
eu-stack -p 2909 -s
says:It's not just for invalid paths, here's another one:
elaforge 29718 0.0 0.1 514780 17892 ? Sl Jul23 0:00 /nix/store/zg66y04g2bvmw41cgrywysr86s40g5cc-nix-2.2/bin/nix copy --from http://narpile.groq /nix/store/5naa5ppxqz233bhwms2hkh8ifvn3f1z2-silently-1.2.5-doc --option allowed-impure-host-deps /groq /etc/hostname --option extra-sandbox-paths /groq/models --option substituters http://narpile.groq --option sandbox relaxed --option require-sigs false --option fallback true --option keep-outputs true --option max-jobs 0 --option builders-use-substitutes true --option cores 0
This path exists and when I run by hand it completes immediately (I have it locally). The stack looks similar:
This is nixStable from nixpkgs 19.03 at 5c52b25283a6cccca443ffb7a358de6fe14b4a81. The OS is the GCP image:
ubuntu-1604-xenial-v20180306