Closed p-alik closed 3 years ago
I was considering this same change, but I was worried it didn’t address the underlying problem. I think it’s worth a try though.
Sometime between last week and today, I seem to have lost the ability to restart gearman/gearmand builds in Travis CI. Anyone know why? Never mind. Signing out of Travis CI and signing back in again fixed this.
This doesn't seem to have completely solved the problem we mentioned in https://github.com/gearman/gearmand/issues/301#issuecomment-675924682 however like we said before, it's possible this a completely different thing from our original issue:
We have had it take a few goes to crash out with the worker than doesn't timeout for some reason.
Additionally, now we've completely isolated the gearmand, it doesn't crash when testing that; however if you run worker_that_doesnt_timeout.pl
twice, it will still end up with -1 jobs running even with this change in, although now we end up with 1 in Jobs Waiting as well:
Well, you could add similar guards wherever job_running is decremented. There are only two locations:
https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L301 https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L385
But I think gearmand will still hang. Could you try that, @infraweavers ?
In order to get to -1, job_running had to have been decremented more than once, obviously. Is gearman_server_job_free( )
being called on the same job twice? I’m also kind of wondering what happens if you comment out
https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L385
entirely.
Well, you could add similar guards wherever job_running is decremented.
I did it yesterday, but the result wasn't much better:-(
There are only two locations:
a1873eacd2527aeff88ea055164704f0b973175f
968e2836572a87a8002483a8d4991741096fe765
THis PR does at least seem to stop the jobs_running
column from going negative in mini-test
Going to close this, but feel free to ping / re-open if you can add a regression test. I am still thinking there's something deeper and this is just treating the symptom.
Yeah sounds good
The PR aims to fix #301. My tests are based on https://github.com/gearman/gearmand/issues/301#issuecomment-675059023