gearmand hangs intermittently when timeouts are used

infraweavers commented 3 years ago

Hello!,

We've recently upgraded to OMD3.30 which has replaced the ancient gearmand 0.33 that was shipped previously with 1.1.19. This includes mod-gearman (for naemon) and pnp_gearman_worker (for pnp4nagios). We find that periodically all our naemon checks stop responding. At the same time when gearman_top is executed, we get an error:

failed to connect to address localhost and port 4730: Interrupted system call

We replaced the build of gearmand with one with symbols in it, waited for the problem to occur (about 15-20 minutes) in our setup and found that gearmand seems to be in an infinite loop here: https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L399-L417

It looks like noop_sent is never getting incremented as it is 0 and then it's never getting into the if so it's never sending the noop to any worker and then is stuck forever.

Edit: Running

(gdb) print _global_gearmand->server->worker_wakeup
$4 = 0 '\000'

so actually, it doesn't look like the noop situation is the problem, there must be something wrong with the cyclic list so that head is now not the loop point? We'll try and confirm,

Edit: doesn't look like it's that. We've just stepped through the loop 200 times and the list seems to loop correctly: gdb.txt

Edit: again, this is all wrong. It is leaving the loop as we put a breakpoint after, we think it's just that everytime we gcore the code it just so happens that it's inside that loop. Presumably there's a loop on the outside that it's actually stuck in.

Actually! We think it's this loop that's looping forever: https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 as we put a breakpoint on https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L133 and it's not getting hit despite the programing continuing for a few seconds.

infraweavers commented 3 years ago

So we're reasonably confident now (ha) that the problem is https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 however we can't see how it would ever break out of that loop as nothing sets the job_list to NULL once it's been set. Confused :/

p-alik commented 3 years ago

Thank you for your elaborate report, @infraweavers. Do you know how the bug could be reproduced?

infraweavers commented 3 years ago

@p-alik unfortunately not, our reproduction case currently is to upgrade one of our monitoring servers and deploy the entire configuration to it and wait/restart the services a few times. With our config applied there is about 1200 checks per minute running, which should work out at about 2400 jobs being pushed onto gearman itself.

We're trying to narrow it down to a reproducible case currently, however it's difficult for us to track the exact scenario in which it happens as it appears "random".

p-alik commented 3 years ago

Actually! We think it's this loop that's looping forever: https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 as we put a breakpoint on https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L133 and it's not getting hit despite the programing continuing for a few seconds.

Does gearmand get in the if branch? https://github.com/gearman/gearmand/blob/8f58a8b439ee206eec732c50de982b05e2d40b27/libgearman-server/worker.cc#L127-L130

infraweavers commented 3 years ago

Does gearmand get in the if branch?

We don't believe so, we see no logs for that nor when "nexting" through we've not seen it happen. I'll put a breakpoint on it and confirm now

infraweavers commented 3 years ago

@p-alik No, it doesn't.

p-alik commented 3 years ago

Despite some "if nok"-branches there are only two return GEARMAND_SUCCESS in gearman_server_job_queue. Hence that's the only possible value https://github.com/gearman/gearmand/blob/8f58a8b439ee206eec732c50de982b05e2d40b27/libgearman-server/job.cc#L340

SpamapS commented 3 years ago

Wow that's a doozy. It looks like gearmand is having trouble tearing down workers that disconnect while they have jobs assigned to them. That should work, this code is present to handle the situation, but I will say, you may want to address this defensively, while we work this out, by making sure your workers avoid quitting while they have a job assigned.

The state machine in gearmand is basically the hardest part to debug, so forgive me if I'm asking a bunch of random questions but I'd like to trace through a few parts of the code with some ideas in mind:

How often do your workers disconnect?

How many threads are you running with?

Do you run with retries configured? If so, do you get any logged failures to retry?

If you use foreground jobs, do you get WORK_FAIL packets sent to your clients?

Are you primarily using background or foreground jobs?

Do you have a stateful store set up for background jobs? If so which one?

Finally, if you've got a debugger attached in one of these situations, if you can watch the job_list, and let us know what is in it?

One suspicion I have, is that the job list is somehow getting corrupted, and is no longer null-terminated. TBH, I've been out of the code long enough, and I have been avoiding revisiting those macros long enough, I'm not even sure that's how they're supposed to work. I'd love to replace them with nice C++ std::list, because zomg this is my nightmare, but for now, if you can just verify that the list does in fact only go in one direction, and end with the _next element set to NULL, and not looped back onto itself, that would help.

Also, I wish I could offer some of my time to dig through a core file or something, but I don't have that kind of time outside day job, and gearmand is not part of dayjob. You may want to seek out a consultant with deep C/C++ skills who can take your money on this one. ;)

sni commented 3 years ago

I am trying to give some answers for naemons mod-gearman in general:

workers will be spawned and stopped dynamically to respond to the workload
options for the gearmand are: --port=4735 --pid-file=... /gearmand.pid --daemon --threads=0 -q libsqlite3 --libsqlite3-db=.../gearmand.db --store-queue-on-shutdown --log-file=.../gearman/gearmand.log --verbose=ERROR --listen=localhost
mod-gearman uses background jobs only
payload is base64 only
job unique identifier might be arbitrary data

infraweavers commented 3 years ago

We have 100-500 workers per gearmand (normally around 110), each worker performs 1000 jobs before exiting and being replaced with a new one. Unless it’s idle, then it waits 30 seconds before killing itself. So we think this works out at about 100 disconnect/reconnects per 30 seconds or so.
We are reasonably sure we don’t use retries as we don’t have it configured in any of the mod-gearman files and it looks like it defaults to 0 with omd3.30 using a naemon core and mod_gearman_worker c version (maybe @sni knows for sure).

We do have a debugger attached in a VM snapshot whilst the infinite loop is ongoing, we'll try and get some useful output. Do you have a recommended way of "seeing" what you want to see?

infraweavers commented 3 years ago

Looking at that loop we've added 3 breakpoints (in the same location as the red dots in VScode on the right) and run the loop several times, it looks like worker->job_list is always pointing to the same reference:

infraweavers commented 3 years ago

Again, now looking at job.cc https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L424-L435 it looks like the next item in the job_list is the first item in the job_list so it will just loop forever.

I've tried to demonstrate here as you can see by the printed references: JobsNextIsItself

infraweavers commented 3 years ago

Additionally, whilst trying to get a more easily reproducible example, we've found that before gearmand completely stops responding, we end up with a queue with -1 jobs running :

infraweavers commented 3 years ago

We managed to get this to occur whilst we had the gearmand logging set to INFO. It seems a little suspicious that there's a unique log around the time for "Worker timeout reach on job, requeuing" just before gearmand stopped responding. It's perfectly possible that this is a coincidence however:

esabol commented 3 years ago

Possibly related to issue #119 ?

p-alik commented 3 years ago

We managed to get this to occur whilst we had the gearmand logging set to INFO. It seems a little suspicious that there's a unique log around the time for "Worker timeout reach on job, requeuing" just before gearmand stopped responding.

It seems a worker registered with CAN_DO_TIMEOUT couldn't finish his job in time: https://github.com/gearman/gearmand/blob/8f58a8b439ee206eec732c50de982b05e2d40b27/libgearman-server/connection.cc#L655-L657

https://github.com/gearman/gearmand/blob/8f58a8b439ee206eec732c50de982b05e2d40b27/libgearman-server/server.cc#L730-L731

The errors could be reproduced with:

client: echo x| ./gearmand/bin/gearman -f x

worker.pl


use v5.10;
use strict;
use warnings;
use Gearman::Worker;

my $timeout = 2; my $worker = Gearman::Worker->new(debug => 1); $worker->job_servers({ host => "localhost", port => 4730 },); $worker->register_function( 'x', $timeout, sub { say "going to sleep"; sleep $timeout * 2; } ); $worker->work( oncomplete => sub { my ($jobhandle, $result) = @; say "on complete $jobhandle"; }, onfail => sub { my ($jobhandle, $err) = @; say "on fail $jobhandle"; }, ) while 1;

* `gearmand` log

WARNING 2020-08-17 20:58:01.000000 [ 3 ] Worker timeout reached on job, requeueing: H:varenik:1 917c2972-e0bb-11ea-b55c-7446a09167b1 -> libgearman-server/connection WARNING 2020-08-17 20:58:01.000000 [ 3 ] if worker -> libgearman-server/job.cc:344 WARNING 2020-08-17 20:58:01.000000 [ 3 ] if worker_list -> libgearman-server/job.cc:397 WARNING 2020-08-17 20:58:01.000000 [ 3 ] return GEARMAND_SUCCESS -> libgearman-server/job.cc:441 WARNING 2020-08-17 20:58:04.000000 [ proc ] GEARMAN_JOB_NOT_FOUND:Job given in work result not found -> libgearman-server/server.cc:779 INFO 2020-08-17 20:58:04.000000 [ 3 ] Peer connection has called close() INFO 2020-08-17 20:58:04.000000 [ 3 ] Disconnected 127.0.0.1:53120 INFO 2020-08-17 20:58:04.000000 [ 3 ] Gear connection disconnected: -:-

* status

$ ./bin/gearadmin --status x 4294967295 0 1 .

esabol commented 3 years ago

I haven’t done much with this part of the code, so I really don’t know what I’m talking about, but this line looks suspicious to me:

https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/worker.cc#L126

It’s passing worker->job_list to gearman_server_job_queue() to requeue. I would think you’d want to create a new job struct to requeue or at least nullify some stuff in the job struct before requeuing the same job struct?

SpamapS commented 3 years ago

We don't need a new job structure. What that code attempts to do, is remove it from the currently assigned worker, and push it back into the unassigned queue. The whole of gearmand works this way, adjusting references.

It feels like we may be zeroing in on it. I think that the bug is exactly there @esabol, but not for the reasons you might be thinking. My theory, which I'm just testing by reading code, and that's kinda hard, is that the timeout screws up the list, leaving it looping around on itself.

I'll try and find some time to debug tonight or tomorrow. We can probably write a regression test around this scenario pretty easily.

infraweavers commented 3 years ago

@p-alik we've been playing with using that client arrangement and it does seem that "sometimes" it can trigger the event that causes gearmand to lock up, however we still have the other load on the gearman instance as well, so it's possible that it's just happening randomly and it correllates with us using the perl worker.

p-alik commented 3 years ago

@infraweavers, perl Gearman::Worker dies on first _"GEARMAN_JOB_NOTFOUND:Job given in work result not found" because it gets unexpected ERROR in response for GRAB_JOB. The worker implementation in your case seems to proceed to interact with gearmand with the same connection, which leads to the "Job given in work result not found"-loop. I guess that's the only difference.

infraweavers commented 3 years ago

@p-alik Yeah, I wonder if that's what triggers the actual problem. We've not managed to use the perl worker in a loop on an isolated gearmand and reproduce the problem yet, so it feels like it may not have been quite as clear-cut as that.

infraweavers commented 3 years ago

We've been watching it a lot today to try and get some handle on the situation where it happens, we're pretty sure that before it breaks, we've always ended up with a -1 in the Jobs Running column on gearman_top. A few seconds later it seems to end up stuck in that loop again

p-alik commented 3 years ago

@infraweavers, I hope #302 solves the -1-issue. At least in my aforementioned test it did. After the first worker run in timeout. status showed proper result:

$ ./bin/gearadmin --status
x       1       0       0
.

Next worker handled the job without timeout and exited afterwards:

$ ./bin/gearadmin --status
x       0       0       0
.

infraweavers commented 3 years ago

@p-alik we weren't able to consistently get the -1 issue using the aforementioned test however we have found that if we:

run the worker
push an item onto the queue
wait for the worker to timeout
run another worker that doesn't timeout to consume the message (twice)

Then gearmand will crash out, however it's in a completely different way from the original issue raised as the process vanishes rather than getting stuck in an infinite loop. I've attached a gif of this problem, however just regarding

9LIxRd20Ff

worker_that_doesnt_timeout.pl.txt worker_that_timesout.pl.txt

We'll apply the fix from https://github.com/gearman/gearmand/pull/302 and see if that resolves it crashing in this situation.

p-alik commented 3 years ago

That's odd running gearmand (built from master branch 8f58a8b439ee206eec732c50de982b05e2d40b27) doesn't die in my environment. After worker_that_doesnt_timeout.pl.txt finishes the stuff. Status looks

$ ./bin/gearadmin --status
x       4294967295      4294967295      0

An attempt to kill gearmand by Ctrl-C doesn't work afterwards:

double free or corruption (out)

Begin stack trace, frames found: 13

For #302 second worker worker_that_doesnt_timeout.pl.txt dies with "unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]". But without $timeout worker_that_doesnt_timeout.pl.txt finishes the work faultless, status looks good and gearmand dies on Ctrl-C.

11c11
<     'x', $timeout,
---
>     'x', # $timeout,

infraweavers commented 3 years ago

Interesting, so we've just built https://github.com/gearman/gearmand/commit/8f58a8b439ee206eec732c50de982b05e2d40b27 and the the behaviour is unchanged for us (i.e. the same as the gif above).

We're building and running on Debian 9.13 x64 freshly installed to test this whole scenario out so there's a minimum amount of stuff installed etc.

It's very odd that we don't have any logs in dmesg or /var/log/syslog also, so it doesn't seem to be segfaulting or similiar as that'd be visible in there.

infraweavers commented 3 years ago

Turns out that if we remove all the other functions and workers from the gearmand and stop all the other load through it, we can get the same behaviour where gearmand doesn't close, but it sits there with a -1 without that patch applied:

vLqjcCaoCg

infraweavers commented 3 years ago

We've just completely reinstalled with gearmand 1.1.19 from OMD3.30 it looks like https://github.com/gearman/gearmand/files/5094915/worker_that_doesnt_timeout.pl.txt causes gearmand to have -1 in the jobs running even though it doesn't timeout. So that's ... odd..

SpamapS commented 3 years ago

That -1 is a symptom, not a cause. It's actually quite an obvious one. the job_running total is only decremented in gearmand_server_job_free or gearman_server_job_queue. In both instances, these always call GEARMAND_LIST_DEL on the linked list that we see in a bad state. So this is a sign that we are probably hitting that code twice before we get into the bad state. Once gearman_server_jobs_queue has made it through once, it won't hit that code again, because it sets job->worker to NULL, which is further evidence that the problem happens elsewhere.

The next step, I think, is to start looking at everywhere that can mutate worker->job_list and try to reason about whether they will result in the situation. We can also have @infraweavers run a version that has some tracing added that will tell us which call happens right before the problem.

infraweavers commented 3 years ago

We've added some logging and been testing it as suggested, (https://github.com/infraweavers/gearmand/commit/ffee57dc5fe875f58b7bf3a8ef004080928d6f6d) The first time we got a straight forward reproduction of the -1 our logs looked like this:

   INFO 2020-08-20 16:02:17.000000 [  main ] DEBUG: job.cc - line 265 - gearman_server_job_queue - Numerator: 0, Denominator: 0
 NOTICE 2020-08-20 16:02:17.000000 [  main ] accepted,x,23e46d2e-e2f6-11ea-a9e4-fe2a35974c93,0 -> libgearman-server/server.cc:321
   INFO 2020-08-20 16:02:17.000000 [  main ] Accepted connection from 127.0.0.1:46788
   INFO 2020-08-20 16:02:17.000000 [  main ] Peer connection has called close()
   INFO 2020-08-20 16:02:17.000000 [  main ] Disconnected 127.0.0.1:46788
   INFO 2020-08-20 16:02:17.000000 [  main ] Gear connection disconnected: -:-
   INFO 2020-08-20 16:02:18.000000 [  main ] DEBUG: client.cc - line 94 - GEARMAND_LIST_DEL
   INFO 2020-08-20 16:02:18.000000 [  main ] DEBUG: client.cc - line 100 - GEARMAND_LIST_DEL
   INFO 2020-08-20 16:02:18.000000 [  main ] DEBUG: job.cc - line 321 - GEARMAND_LIST_DEL (list: 174378160 item: 174378160) 0
WARNING 2020-08-20 16:02:18.000000 [  main ] Worker timeout reached on job, requeueing: H:OMD2:1 23e46d2e-e2f6-11ea-a9e4-fe2a35974c93 -> libgearman-server/connection.cc:661
   INFO 2020-08-20 16:02:18.000000 [  main ] DEBUG: connection.cc - line 669 - gearman_server_job_queue
   INFO 2020-08-20 16:02:18.000000 [  main ] DEBUG: job.cc - line 389 - GEARMAND_LIST_DEL (list: 0 item: 174378160) running: 0
   INFO 2020-08-20 16:02:18.000000 [  main ] DEBUG: job.cc - line 321 - GEARMAND_LIST_DEL (list: 174378160 item: 174378160) -1

From this, it looks to us like both https://github.com/infraweavers/gearmand/blob/logging/libgearman-server/job.cc#L390 and https://github.com/infraweavers/gearmand/blob/logging/libgearman-server/job.cc#L323 get called for the same item and it therefore removes it twice!?

infraweavers commented 3 years ago

We added more logging to hopefully see where it's being called from (https://github.com/infraweavers/gearmand/commit/60ba6185d13d745420debb80aef2cfb59e6d1566) and the logs look like:

   INFO 2020-08-20 16:46:46.000000 [  main ] Gear connection disconnected: -:-
   INFO 2020-08-20 16:46:46.000000 [  main ] Accepted connection from 127.0.0.1:55836
   INFO 2020-08-20 16:46:46.000000 [  main ] Accepted connection from 127.0.0.1:55840
WARNING 2020-08-20 16:46:46.000000 [  main ] DEBUG: job->worker 0 -> libgearman-server/job.cc:349
WARNING 2020-08-20 16:46:46.000000 [  main ] DEBUG: job->function->worker_list 1676633216 -> libgearman-server/job.cc:405
WARNING 2020-08-20 16:46:46.000000 [  main ] DEBUG: returning GEARMAND_SUCCESS -> libgearman-server/job.cc:451
   INFO 2020-08-20 16:46:46.000000 [  main ] DEBUG: job.cc - line 265 - gearman_server_job_queue - Numerator: 0, Denominator: 0
 NOTICE 2020-08-20 16:46:46.000000 [  main ] accepted,x,5b2f098c-e2fc-11ea-9243-fe2a35974c93,0 -> libgearman-server/server.cc:321
   INFO 2020-08-20 16:46:47.000000 [  main ] Accepted connection from 127.0.0.1:55844
   INFO 2020-08-20 16:46:47.000000 [  main ] Peer connection has called close()
   INFO 2020-08-20 16:46:47.000000 [  main ] Disconnected 127.0.0.1:55844
   INFO 2020-08-20 16:46:47.000000 [  main ] Gear connection disconnected: -:-
WARNING 2020-08-20 16:46:47.000000 [  main ] DEBUG: gearman_server_job_free(1676640976) -> libgearman-server/server.cc:807
   INFO 2020-08-20 16:46:47.000000 [  main ] DEBUG: client.cc - line 94 - GEARMAND_LIST_DEL
   INFO 2020-08-20 16:46:47.000000 [  main ] DEBUG: client.cc - line 100 - GEARMAND_LIST_DEL
   INFO 2020-08-20 16:46:47.000000 [  main ] DEBUG: job.cc - line 321 - GEARMAND_LIST_DEL (list: 1676640976 item: 1676640976) 0
WARNING 2020-08-20 16:46:47.000000 [  main ] Worker timeout reached on job, requeueing: H:OMD2:1 5b2f098c-e2fc-11ea-9243-fe2a35974c93 -> libgearman-server/connection.cc:661
 NOTICE 2020-08-20 16:46:47.000000 [  main ] DEBUG: connection.cc - line 669 - gearman_server_job_queue -> libgearman-server/connection.cc:666
WARNING 2020-08-20 16:46:47.000000 [  main ] DEBUG: job->worker 1676633216 -> libgearman-server/job.cc:349
   INFO 2020-08-20 16:46:47.000000 [  main ] DEBUG: job.cc - line 389 - GEARMAND_LIST_DEL (list: 0 item: 1676640976) running: 0
WARNING 2020-08-20 16:46:47.000000 [  main ] DEBUG: job->function->worker_list 1676633216 -> libgearman-server/job.cc:405
WARNING 2020-08-20 16:46:47.000000 [  main ] DEBUG: returning GEARMAND_SUCCESS -> libgearman-server/job.cc:451
WARNING 2020-08-20 16:46:47.000000 [  main ] DEBUG: gearman_server_job_free(1676640976) -> libgearman-server/gearmand_con.cc:430
   INFO 2020-08-20 16:46:47.000000 [  main ] DEBUG: job.cc - line 321 - GEARMAND_LIST_DEL (list: 1676640976 item: 1676640976) -1
   INFO 2020-08-20 16:46:47.000000 [  main ] Peer connection has called close()

it looks like the call is : https://github.com/gearman/gearmand/blob/master/libgearman-server/gearmand_con.cc#L428 which presumably means that the timed out job has server_job->ignore_job set to true? Should it be false or should it not exist in the queue at all?

infraweavers commented 3 years ago

Another instance with the logging tided up a smidge (https://github.com/infraweavers/gearmand/commit/fc09194a0e57d0b17551c6622f51a4bd8c787a79) :

WARNING 2020-08-21 16:59:42.000000 [  main ] DEBUG: gearman_server_job_free(188804880) -> libgearman-server/server.cc:807
WARNING 2020-08-21 16:59:42.000000 [  main ] DEBUG: GEARMAND_LIST_DEL -> libgearman-server/client.cc:96
WARNING 2020-08-21 16:59:42.000000 [  main ] DEBUG:  GEARMAND_LIST_DEL -> libgearman-server/client.cc:102
   INFO 2020-08-21 16:59:42.000000 [  main ] DEBUG: job.cc - line 321 - GEARMAND_LIST_DEL (list: 188804880 item: 188804880) 0
WARNING 2020-08-21 16:59:42.000000 [  main ] Worker timeout reached on job, requeueing: H:OMD2:1 53163a38-e3c7-11ea-a225-fe2a35974c93 -> libgearman-server/connection.cc:661
 NOTICE 2020-08-21 16:59:42.000000 [  main ] DEBUG: connection.cc - line 669 - gearman_server_job_queue -> libgearman-server/connection.cc:666
WARNING 2020-08-21 16:59:42.000000 [  main ] DEBUG: job->worker 188801872 -> libgearman-server/job.cc:349
   INFO 2020-08-21 16:59:42.000000 [  main ] DEBUG: job.cc - line 389 - GEARMAND_LIST_DEL (list: 0 item: 188804880) running: 0
WARNING 2020-08-21 16:59:42.000000 [  main ] DEBUG: job->function->worker_list 188801872 -> libgearman-server/job.cc:405
WARNING 2020-08-21 16:59:42.000000 [  main ] DEBUG: returning GEARMAND_SUCCESS -> libgearman-server/job.cc:451
   INFO 2020-08-21 16:59:42.000000 [  main ] Peer connection has called close()
   INFO 2020-08-21 16:59:42.000000 [  main ] Disconnected 127.0.0.1:55548
   INFO 2020-08-21 16:59:42.000000 [  main ] Gear connection disconnected: -:-
WARNING 2020-08-21 16:59:42.000000 [  main ] DEBUG: gearman_server_job_free(188804880) -> libgearman-server/gearmand_con.cc:430
   INFO 2020-08-21 16:59:42.000000 [  main ] DEBUG: job.cc - line 321 - GEARMAND_LIST_DEL (list: 188804880 item: 188804880) -1

So what that looks like to me is that if the worker disconnects after a job for it has been requeued then we hit the -1 situation as the GEARMAND_LIST_DEL gets called by the job being requeued as well as the connection being closed? I think that's what's going on in this situation anyway.

SpamapS commented 3 years ago

Agree, I think we may have the smoke trail, now we need to follow it to the gun. And since you're running with --threads=0 this is not a thread-safety issue.

So I think this must be an event concurrency issue, where libevent has paused one event, and started another, and these assume that they only run in the main thread, and thus don't need locks.

I'll keep digging in this direction. Thanks for all the trace information!

SpamapS commented 3 years ago

If I'm reading the code right, ignore_job is only set to true in two instances:

When all of the clients attached to a foreground job disconnect. This is done so that any results/fails/etc. are dropped silently when workers complete a foreground job that no longer needs any comms back to clients. Since you've said everything is a background job, that's not happening.
When a job is cancelled via the admin cancel command. Are you using this command? It's actually relatively new, added quite a bit after your old 0.33 release.

Are you using "cancel" at all? Could just be running gearadmin --cancel-job

SpamapS commented 3 years ago

Eliminated one possibility, both from logs and reading the code.

gearman_server_job_queue is called in _server_job_timeout and IF it ever returned anything other than GEARMAND_SUCCESS, we'd double-free.

But the function only has two returns, both of which are GEARMAND_SUCCESS. We may actually want to just make the function a void as a result, to be more clear that it can't return anything else.

Out of time for this for now. Hopefully will look again soon, but I think we're getting closer.

esabol commented 3 years ago

gearman_server_job_queue is called in _server_job_timeout and IF it ever returned anything other than GEARMAND_SUCCESS, we'd double-free.

But the function only has two returns, both of which are GEARMAND_SUCCESS. We may actually want to just make the function a void as a result, to be more clear that it can't return anything else.

Should it always return GEARMAND_SUCCESS though? Why doesn't it return a failed status here?

https://github.com/gearman/gearmand/blob/8f58a8b439ee206eec732c50de982b05e2d40b27/libgearman-server/job.cc#L379-L380

It should only get there if the number of retries exceeds the maximum number of retries allowed.

SpamapS commented 3 years ago

Should it always return GEARMAND_SUCCESS though? Why doesn't it return a failed status here?

Yes, because the server did exactly as it was configured to do at that point. The log is the last gasp as the job is obliterated.

SpamapS commented 3 years ago

Ok I think the logs and reading the code suggest that this happens specifically when two things happen very close together:

1) The worker's connection hits the CAN_DO_TIMEOUT 2) The worker closes its own connection around the same time, and definitely before the job is picked up by another worker.

The timeout seems to leave something dangling, possibly that gets shored up later, but before it can, the worker's connection is freed, and thus, the structure is broken and we get stuck from there.

The next thing to do is probably to try and build a reliable simulation of this to test the hypothesis. But I'm really out of time and energy for this, for now.

infraweavers commented 3 years ago

Are you using "cancel" at all? Could just be running gearadmin --cancel-job

@SpamapS No, we're not doing that explicitly and it doesn't look like the gearman_worker does it

sni commented 3 years ago

There is no part in Mod-Gearman which calls cancel in any way.

sni commented 3 years ago

Btw, there is also no part which calls anything timeout related. Or is there some implicit timeout in gearmand? Mod-Gearman uses the C-Library and the new worker uses the Go-Library.

infraweavers commented 3 years ago

@sni does the mod-gearman C worker specify the CAN_DO_TIMEOUT flag do you know?

sni commented 3 years ago

no, workers are created like this: https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L33 and clients like here: https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L73 Besides that, there should be no way a worker or client should confuse and break the server, either by misconfiguration or by misbehaviour (intended or not)

infraweavers commented 3 years ago

@sni that is interesting. So we started investigating the timeout option because of this log: https://user-images.githubusercontent.com/36475663/90411874-4a0f5a80-e0a4-11ea-818e-26af02eac2a9.png That was OMD3.30 upgraded box (called OMDA03) running a copy of our usual naemon config. Judging by the name of the job that's being requeued, could it be a result job rather than a check job?

sni commented 3 years ago

i cannot tell that from the name. Thats nothing mod-gearman sets. Result queue jobs are usually super fast, its just a few bytes of text put on a internal linked list. This takes less than a millisecond usually.

infraweavers commented 3 years ago

@sni Hmm judging by the output from running the worker in trace logging, it looks it isn't a result queue, it looks like it's a check job as they seem to have handles in the format: `H:$hostname:$JobId". We're going to try and reproduce the problem with the worker logging enabled, maybe that'll shed some light on the sequence of events.

infraweavers commented 3 years ago

We've just been adding more logging in around the setting of ignore_job and it appears it's being set in the timeout path it looks like (screenshot based on our fork at https://github.com/infraweavers/gearmand/commit/696d8a76680bfd2577d8cbedc4c51a686309c318):

infraweavers commented 3 years ago

Following on from the above, we're a little confused about how the requeuing/timing out should actually work. It sets ignore_job (https://github.com/gearman/gearmand/blob/master/libgearman-server/client.cc#L101), but then requeues the same job. So it seems like it's assuming it's going to be thrown away, yet actually isn't. So we changed gearman_server_job_queue to also reset the ignore_job flag, and now the job is counted in jobs waiting, but still has -1 in jobs running (https://github.com/infraweavers/gearmand/commit/87b9808de8ab78395a6f0891c64b606f9b916208#diff-cfb5bf99d09669ca4e68a8359da7ab11). It feels like something isn't getting cleaned up quite right around here: https://github.com/infraweavers/gearmand/blob/master/libgearman-server/job.cc#L383 ? But not really sure what the missing piece could be.

p-alik commented 3 years ago

It feels like something isn't getting cleaned up quite right around here: https://github.com/infraweavers/gearmand/blob/master/libgearman-server/job.cc#L383 ? But not really sure what the missing piece could be.

does that https://github.com/gearman/gearmand/pull/302/commits/902f7bb73afa72b45170d0ac76e67e6e1af58b9c help?

infraweavers commented 3 years ago

@p-alik no that change doesn't make any difference to the problem, we still end up with -1 running jobs

The complete PR does prevent the -1 appearing in the jobs running when we run the perl-worker above and it undergoes the timeout etc, however we don't know for sure that this is actually the same problem that we experience when running OMD3.30

gearman / gearmand

gearmand hangs intermittently when timeouts are used #301