NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.75k stars 1.52k forks source link

Help debugging randomly hanging processes while pulling paths from an s3 binary cache #2560

Closed AmineChikhaoui closed 3 years ago

AmineChikhaoui commented 5 years ago

I've got a bit of an annoying issue that starting happening recently, some nix processes will hang forever while pulling objects from S3 (nix ls-store, nix-store -r), I've collected some stacktraces etc but couldn't get to the root cause of this behavior. Example of a process that shows the issue:

This is a nix-store --serve --write in a hydra builder:

$ eu-stack -p 6550 -s
PID 6550 - process
TID 6550:
#0  0x00007fef443db473 __select
#1  0x00007fef45683b40 nix::Worker::waitForInput()
#2  0x00007fef456862ef nix::Worker::run(std::set<std::shared_ptr<nix::Goal>, nix::CompareGoalPtrs, std::allocator<std::shared_ptr<nix::Goal> > > const&)
#3  0x00007fef456867fd nix::LocalStore::buildPaths(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, nix::BuildMode)
#4  0x000000000041093f opServe(std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::__cxx11::list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)
#5  0x000000000040bb4a std::_Function_handler<void (), main::{lambda()#1}>::_M_invoke(std::_Any_data const&)
#6  0x00007fef459ed1a2 nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>)
#7  0x0000000000407457 main
#8  0x00007fef4431a7e0 __libc_start_main
#9  0x000000000040767a _start
TID 6570:
#0  0x00007fef446ab4c7 do_sigwait
#1  0x00007fef446ab53d sigwait
#2  0x00007fef453cec55 nix::signalHandlerThread(__sigset_t)
#3  0x00007fef453d476d std::thread::_State_impl<std::_Bind_simple<void (*(__sigset_t))(__sigset_t)> >::_M_run()
#4  0x00007fef44e986df
#5  0x00007fef446a1244 start_thread
#6  0x00007fef443e2ccf __clone
TID 6586:
#0  0x00007fef443d95cd poll
#1  0x00007fef439748e5 Curl_poll
#2  0x00007fef4396ecee curl_multi_wait
#3  0x00007fef456dc9c1 nix::CurlDownloader::workerThreadMain()
#4  0x00007fef456de68c nix::CurlDownloader::workerThreadEntry()
#5  0x00007fef44e986df
#6  0x00007fef446a1244 start_thread
#7  0x00007fef443e2ccf __clone
TID 6694:
#0  0x00007fef443d95cd poll
#1  0x00007fef439748e5 Curl_poll
#2  0x00007fef4396ecee curl_multi_wait
#3  0x00007fef43968cef curl_easy_perform
#4  0x00007fef430e0caa Aws::Http::CurlHttpClient::MakeRequest(Aws::Http::HttpRequest&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const
#5  0x00007fef430af10f Aws::Client::AWSClient::AttemptOneRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod) const
#6  0x00007fef430af978 Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod) const
#7  0x00007fef430b1e22 Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod) const
#8  0x00007fef4340e802 Aws::S3::S3Client::GetObject(Aws::S3::Model::GetObjectRequest const&) const
#9  0x00007fef45769d56 nix::S3Helper::getObject(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#10 0x00007fef45770536 std::_Function_handler<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (), nix::S3BinaryCacheStoreImpl::getFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >)>, std::function<void (std::__exception_ptr::exception_ptr)>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
#11 0x00007fef4570f367 void nix::sync2async<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >(std::function<void (std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >)> const&, std::function<void (std::__exception_ptr::exception_ptr)> const&, std::function<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > ()> const&)
#12 0x00007fef4576c15b nix::S3BinaryCacheStoreImpl::getFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >)>, std::function<void (std::__exception_ptr::exception_ptr)>)
#13 0x00007fef45664595 nix::BinaryCacheStore::getFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#14 0x00007fef45666667 nix::BinaryCacheStore::narFromPath(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::Sink&)
#15 0x00007fef457831e3 nix::copyStorePath(nix::ref<nix::Store>, nix::ref<nix::Store>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, nix::RepairFlag, nix::CheckSigsFlag)
#16 0x00007fef45672d54 nix::SubstitutionGoal::tryToRun()::{lambda()#1}::operator()() const
#17 0x00007fef44e986df
#18 0x00007fef446a1244 start_thread
#19 0x00007fef443e2ccf __clone

Doing an strace of the process shows that it's continuously reading from a socket (the s3 connection):

$ strace -f -p 6550  -y
strace: Process 6550 attached with 4 threads
[pid  6586] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  6570] rt_sigtimedwait([HUP INT PIPE TERM WINCH], NULL, NULL, 8 <unfinished ...>
[pid  6694] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  6550] select(20, [19<pipe:[324346884]>], NULL, NULL, NULL <unfinished ...>
[pid  6694] <... restart_syscall resumed> ) = 1
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "X\f\7g\2\253\307'\6\356]}\310/\217\321\315\0N\244-\221\227\204\214d\263K\361\16\337\326"..., 367) = 367
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN}], 1, 713) = 1 ([{fd=16, revents=POLLIN}])
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "\27\3\3@\0", 5) = 5
[pid  6694] read(16<socket:[324338301]>, "\261^*\275.]JK\222\345^\274\224tm\214\267A\223lD\26\227\r\30Rdgz\236AE"..., 16384) = 2492
[pid  6694] read(16<socket:[324338301]>, 0x7fef3810e154, 13892) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN}], 1, 712) = 1 ([{fd=16, revents=POLLIN}])
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "\252^ud\360%\277;#$\346\213\307%\365\254^\362\253\203\373\206}\325\315\226\300\300\374)\200\16"..., 13892) = 1432
[pid  6694] read(16<socket:[324338301]>, 0x7fef3810e6ec, 12460) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN}], 1, 666) = 1 ([{fd=16, revents=POLLIN}])
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "\20)'\215\377\23O\265\270\341[8(\215x\272K\262m\264\302n4~\261\320\333\326F\230\244\212"..., 12460) = 2864
[pid  6694] read(16<socket:[324338301]>, 0x7fef3810f21c, 9596) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN}], 1, 666) = 1 ([{fd=16, revents=POLLIN}])
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "~\260\361\357\250\354z \223~\201\313kV\27\357)\5\226\324\r\0103\204v\4\252S\220\276\31\264"..., 9596) = 1432
[pid  6694] read(16<socket:[324338301]>, 0x7fef3810f7b4, 8164) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN}], 1, 620) = 1 ([{fd=16, revents=POLLIN}])
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "\376D\270%\241*g\253hQ\233\206\261\223\316\341i\226\3452\345\203J\2220\304`\371H\335\276@"..., 8164) = 2864
[pid  6694] read(16<socket:[324338301]>, 0x7fef381102e4, 5300) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN}], 1, 620) = 1 ([{fd=16, revents=POLLIN}])
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "\315x\343\214\356\23\244\241x\340\223\r\323\271\346\247a\20\0]\362\322=\253\0218[@\5\256V\t"..., 5300) = 2864
[pid  6694] read(16<socket:[324338301]>, 0x7fef38110e14, 2436) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN}], 1, 572) = 1 ([{fd=16, revents=POLLIN}])
[pid  6694] poll([{fd=16<socket:[324338301]>, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=16, revents=POLLIN|POLLRDNORM}])
[pid  6694] read(16<socket:[324338301]>, "\260\"ot1:\2740\302NTM6Xv\353\341\353\370\0S\271\206\4\314\333T\200\246\362\24m"..., 2436) = 2436
$ lsof -p 6550  | grep 16u
nix-store 6550 root   16u  IPv4 324338301      0t0       TCP bench-004:34152->s3-1.amazonaws.com:https (ESTABLISHED)

The store path isn't huge, running another nix-store -r <path> of the same store path worked fine on the same machine.

I'm out of ideas on how to debug this further ¯_(ツ)_/

AmineChikhaoui commented 5 years ago

Ok it finished downloading just after I created the issue :D But it took about ~30minutes while it should be much faster.

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info