NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.06k stars 1.47k forks source link

Deadlock with remote builders #2260

Open christian-marie opened 6 years ago

christian-marie commented 6 years ago

I've been running a Hydra server for a few months now, and it's been running fine with a single x86_64-linux remote build slave. After an indeterminate period, remote building starts hanging until timeout on the hydra-queue-runner side. The slave node will just sit there with N nix-store --serve --write commands, where N = the number of concurrent builds configured.

The breakage doesn't seem to be related to any configuration change, as a reprovision of the hydra-master and slave nodes has fixed this issue twice now. Just rebuilding the slave doesn't. I'm getting sick of reprovisioning the build server, so now I'm digging.

The hydra-master machine is stuck in this state now, and I have narrowed reproducing this issue to triggering any remote build whatsoever in whatever state the server has ended up in.

I found via strace that the hydra-master build would wait on a select syscall (waiting for children here: https://github.com/NixOS/nix/blob/master/src/libstore/build.cc#L4170), whilst the nix-store --serve --write would wait on a read(0,..)

Now the weird bit: that blocking read appears to happen here: https://github.com/NixOS/nix/blob/master/src/nix-store/nix-store.cc#L790

I came to that conclusion by the uniqueness of readNum in that function, as per the stack trace below.

I don't know how this is meant to work, or if my analysis is correct, but it seems like some kind of protocol mismatch? I'm now building nix from source to test any suggestions. I should be ready for that in around 12 hours. I'll probably timebox finding a solution to this to this time tomorrow, and just rebuild again otherwise.

Perhaps you make something else of these stack traces:

hydra-slave1 nix-store --serve --write, blocked on read syscall of stdin:

(gdb) bt
#0  0x00007ff57e029b1d in read () from target:/nix/store/2kcrj1ksd2a14bm5sky182fv2xwfhfap-glibc-2.26-131/lib/libpthread.so.0
#1  0x00007ff57ed7d67a in nix::FdSource::readUnbuffered(unsigned char*, unsigned long) () from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixutil.so
#2  0x00007ff57ed7ca2d in nix::BufferedSource::read(unsigned char*, unsigned long) () from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixutil.so
#3  0x00007ff57ed7cf98 in nix::Source::operator()(unsigned char*, unsigned long) () from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixutil.so
#4  0x000000000041948b in unsigned int nix::readNum<unsigned int>(nix::Source&) ()
#5  0x000000000040f4e0 in opServe(std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) ()
#6  0x000000000040bfba in std::_Function_handler<void (), main::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#7  0x00007ff57f39fa7d in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixmain.so
#8  0x000000000040733a in main ()
(gdb) bt
#0  0x00007ff57e029b1d in read () from target:/nix/store/2kcrj1ksd2a14bm5sky182fv2xwfhfap-glibc-2.26-131/lib/libpthread.so.0
#1  0x00007ff57ed7d67a in nix::FdSource::readUnbuffered(unsigned char*, unsigned long) () from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixutil.so
#2  0x00007ff57ed7ca2d in nix::BufferedSource::read(unsigned char*, unsigned long) () from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixutil.so
#3  0x00007ff57ed7cf98 in nix::Source::operator()(unsigned char*, unsigned long) () from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixutil.so
#4  0x000000000041948b in unsigned int nix::readNum<unsigned int>(nix::Source&) ()
#5  0x000000000040f4e0 in opServe(std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) ()
#6  0x000000000040bfba in std::_Function_handler<void (), main::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#7  0x00007ff57f39fa7d in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from target:/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixmain.so
#8  0x000000000040733a in main ()

hydra-master nix-env process, blocked on select:

(gdb) bt
#0  0x00007ffff62475c0 in select () from /nix/store/2kcrj1ksd2a14bm5sky182fv2xwfhfap-glibc-2.26-131/lib/libc.so.6
#1  0x00007ffff7312720 in nix::Worker::waitForInput() () from /nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixstore.so
#2  0x00007ffff73141c8 in nix::Worker::run(std::set<std::shared_ptr<nix::Goal>, nix::CompareGoalPtrs, std::allocator<std::shared_ptr<nix::Goal> > > const&) () from /nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixstore.so
#3  0x00007ffff7314553 in nix::LocalStore::buildPaths(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, nix::BuildMode) () from /nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixstore.so
#4  0x000000000042489f in nix::createUserEnv(nix::EvalState&, std::__cxx11::list<nix::DrvInfo, traceable_allocator<nix::DrvInfo> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#5  0x00000000004196f7 in opInstall(Globals&, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) ()
#6  0x000000000040c32e in main::{lambda()#1}::operator()() const ()
#7  0x00007ffff7679a7d in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from /nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/lib/libnixmain.so
#8  0x000000000040999a in main ()
(gdb) run -vvvvv -iA nixos.hello --option build-use-substitutes false

And finally, the tail of debug output from nix-env:

starting build hook '/nix/store/j1mjqwvlhfmhpj3nrzap4zb221s2ciqw-nix-2.0.4/libexec/nix/build-remote'
got 1 remote builders
considering building on remote machine 'ssh://root@hydra-slave1'
connecting to 'ssh://root@hydra-slave1'...
hook reply is 'accept'
building '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv' on 'ssh://root@hydra-slave1'...
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 172 bytes
waiting for the upload lock to 'ssh://root@hydra-slave1'...
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 122 bytes
copying dependencies to 'ssh://root@hydra-slave1'...
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 90 bytes
copying 1 paths...
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 224 bytes
starting pool of 3 threads
querying remote host 'root@hydra-slave1' for info on '/nix/store/3x7dwzq014bblazs7kq20p9hyzz0qh8g-hello-2.10.tar.gz'
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 153 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 1260 bytes
copying path '/nix/store/3x7dwzq014bblazs7kq20p9hyzz0qh8g-hello-2.10.tar.gz' to 'ssh://root@hydra-slave1'...
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 243 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 84 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 85 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 255 bytes
waiting for children
building of '/nix/store/2mcs5m10gn22b06m4xprsvvw8c1pg599-hello-2.10.drv': read 406 bytes
adding path '/nix/store/3x7dwzq014bblazs7kq20p9hyzz0qh8g-hello-2.10.tar.gz' to remote host 'root@hydra-slave1'
waiting for children
domenkozar commented 6 years ago

Similar: https://github.com/NixOS/hydra/issues/366

HackerFoo commented 4 years ago

I ran into something similar after running out of disk space, and clearing /nix/var/nix/current-load/ on the local machine seems to fix it.

srghma commented 4 years ago

@HackerFoo :+1: had this problem too because my /boot ran out of space

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info