hydra-queue-runner gets stuck while there are items in the queue

domenkozar commented 8 years ago

(gdb) bt
#0  0x00007f31a596940d in poll () from /nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f31a64db2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f31a64dc4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f31a64e0246 in pqxx::connection_base::await_notification(long, long) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f31a7771802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from /nix/store/r3blahkd7z58xsdbgdyl5hbfgaqn0var-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

Happy to provide more information while the servers is still stuck, but I'll have to restart it today/tomorrow.

It seems to me it's waiting on postgresql?

cc @edolstra

domenkozar commented 8 years ago

Ah, it got interrupted after a day:

Jul 25 11:36:53 eiger hydra-queue-runner[30255]: got notification: new builds added to the queue
Jul 25 11:36:53 eiger hydra-queue-runner[30255]: checking the queue for builds > 168416...

I think this doesn't happen often on hydra.nixos.org since new builds come all the time, but something is keeping Hydra queue runner from finishing the queue.

domenkozar commented 8 years ago

Note: I'm running commit https://github.com/NixOS/hydra/commit/dc790c5f7eacda1c819ae222bf87674781ae1124

domenkozar commented 8 years ago

Happened again:

#0  0x00007f232e88840d in poll () from /nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f232f3fa2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f232f3fb4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f232f3ff246 in pqxx::connection_base::await_notification(long, long) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f2330690802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) () from /nix/store/r3blahkd7z58xsdbgdyl5hbfgaqn0var-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

domenkozar commented 8 years ago

@edolstra any ideas what's wrong here?

domenkozar commented 7 years ago

Again:

(gdb) bt
#0  0x00007f43ebccb40d in poll ()
   from target:/nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f43ec83d2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f43ec83e4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f43ec842246 in pqxx::connection_base::await_notification(long, long) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f43edad3802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from target:/nix/store/7isk02splr10h3a3q9a2nq7grfqymynv-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

domenkozar commented 7 years ago

Note: I'm using systemctl restart hydra-queue-runner as a workaround to get queue back to processing.

mbrock commented 6 years ago

I also have a habit of restarting the queue runner. There's some lockup almost every other time I submit a new evaluation.

LisannaAtHome commented 6 years ago

Also running into this (I have https://github.com/NixOS/hydra/commit/dc5e0b120aac06f1c46bcf0296d14775d72c2354), and kicking hydra-queue-runner works.

This seems like a critical bug for general hydra usability, and will definitely turn off non-experts. I'm at least documenting this on the wiki since it seems like everyone with a private hydra instance deals with this.

LisannaAtHome commented 6 years ago

This also breaks declarative jobsets I've noticed. If hydra-queue-runner gets into its stuck state, even if a .jobsets has been evaluated and even built recently, the list of jobsets will not update without restarting hydra-queue-runner.

I now have a cron job that checks hydra-queue-runner every minute and kicks it if there are >0 jobs queued and 0 jobs running, or if there are 0 jobs queued and 0 jobs running (to handle the declarative jobsets case). Seems to work well, but tricky to implement since you have to query both the Hydra JSON API to get the number of jobs running, and scrape an HTML page to get the number of jobs queued. You don't want to restart hydra-queue-runner if there are running jobs that take a long time because those will get canceled and have to start from the beginning.

LisannaAtHome commented 5 years ago

Duplicate issue https://github.com/NixOS/hydra/issues/550

Applying https://github.com/NixOS/hydra/commit/73ca325d1c0f7914640a63764c9a6d448fde5bd0 seems to have fixed this issue for me

tbenst commented 4 years ago

This just happened to me on a new hydra deploy. The first three builds worked fine. I canceled the fourth build partway through. Now the fifth build is queued and won’t run. systemctl restart hydra-queue-runner did not fix unfortunately

tbenst commented 4 years ago

Ok, in my case the problem seems to result from interrupting a build before it finished. I interrupted a job during the build of cuda_10.1 and any time I try to restart, that seems to be related to the hang:

> journalctl -u hydra-queue-runner.service
Dec 23 18:11:47 hydra-server hydra-queue-runner[15353]: queue monitor: std::bad_alloc
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: checking the queue for builds > 0...
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: loading build 5 (tyler:babelfish:babelfish)
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying output ‘/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run’ of ‘/nix/store/zgzivax9qf558rw4wkrcmjfbd09gpmpc-cuda_10.1.2vsr6f78r0a-cuda_10.1.243_418.87.00_linux.run’ of ‘/nix/store/zgzivax9qf558rw4wkrcmjfbd09gpmpc-cuda_10.1.243_418.87.00_linux.run.drv’ from local store
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying 1 paths...
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying path '/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run' to 'file:///var/lib/hydra/cache'...r6f78r0a-cuda_10.1.243_418.87.00_linux.run' to 'file:///var/lib/hydra/cache'...

ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run is never copied to /var/lib/hydra/cache/*. I tried nix-collect-garbage, but the .run file remains, and I cannot rm the file due to read-only filesystem. Any ideas how to fix this broken state? A bit surprised that I could bork the system by cancelling a build.

Also, let me know if I should make a new issue, as I'm no longer sure if this is the same as @domenkozar's original post.

Edit: just realized that the .run file is the url for src = fetchurl ... so it makes sense why that's in nix-store. It's a 2.4G file, but not sure why the copying path command never finishes.

tomberek commented 4 years ago

Not sure if related, but I’ve noticed large derivations (several GiB) take a long time to copy. Either to the local store at the conclusion of a build (like a fetch), or to a remote store. My rough guess is that there me is some quadratic performance characteristic.

If that is the case, it usually sits for a while with no output and a single nix thread spinning at 100%. Perhaps this happened and a 15 minute timeout was reached?

tbenst commented 4 years ago

@tomberek interesting, thanks! That helped me fix it. It looks like the queue monitor: std::bad_alloc consistently happens 1-3 minutes after copying path [...].run begins. I bumped up to a t2.large (8GB ram), and got the bad_alloc error after 17 minutes. Now bumped to a t2.xlarge (16GB RAM) and the copy command finally finished:

Jan 02 20:33:22 hydra-server hydra-queue-runner[1120]: copying path '/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run' (2572375416 bytes, compressed 1.4% in 1168394 ms) to binary cache

19 minutes and 16GB of RAM is crazy for copying a 2.4GB file, but hey it works now. Is this performance problem a nix or hydra problem?

Ironically, it's faster and more memory efficient to grab the .run file from remote than store!! From remote, I could build on a box with only 4GB of RAM.

Edit: my point about time may be unfair. It's possible this was slow due to EBS throttling by AWS. Although I think the memory question is still valid, and it'd be good to know where I should file the issue

tomberek commented 4 years ago

This is a nix problem. But I've been unable to discover more details.

tomberek commented 4 years ago

Upon more investigation, it seems this may be due to the compression. This is normally hidden by other costs and parallelized by having smaller derivations/folders/files. Large files make this more apparent. Using s3://bucket?parallel-compression=true can alleviate the situation. I'm looking hopefully at this: https://github.com/NixOS/nix/pull/2276

xbreak commented 4 years ago

We had frequent problems with hydra-queue-runner not processing build queue. But since a a while I added a 5 minute maximum wait for await_notification and have not experienced an issue since *knocks on wood*.

I've seen at least one case where there was work pending but the notification did not trigger. I know this from logging some extra info. Yes ok, it is possible, but unlikely, that the notification was scheduled to be triggered a time instant after await_notification timed out.

So maybe it make sense to add a configuration option that allows user to set the maximum wait time? Seems better than restarting queue runner in a cron-job.

diff --git a/src/hydra-queue-runner/queue-monitor.cc b/src/hydra-queue-runner/queue-monitor.cc
--- a/src/hydra-queue-runner/queue-monitor.cc
+++ b/src/hydra-queue-runner/queue-monitor.cc
@@ -42,27 +42,37 @@ void State::queueMonitorLoop()

         /* Sleep until we get notification from the database about an
            event. */
         if (done) {
-            conn->await_notification();
+            conn->await_notification(5*60, 0);
             nrQueueWakeups++;
         } else

Update:

No further cases (Jun 26th - Aug 6th) of queue-runner getting stuck, so this has solved the problem for us.

FlorianFranzen commented 3 years ago

The nixos.wiki point to this commit as a possible fix, which is part of #597 and reported by some users to work,

I would love to see this fixed, so would this commit be a generally accepted way to fix this issue?

ivan commented 1 year ago

I applied the patch in https://github.com/NixOS/hydra/issues/366#issuecomment-650126996 five months ago, and since then I haven't needed to restart Hydra anywhere near as often.

NixOS / hydra

hydra-queue-runner gets stuck while there are items in the queue #366