gearman / gearmand

http://gearman.org/
Other
740 stars 137 forks source link

decrement server_job->function->job_total only if it > 0 #302

Closed p-alik closed 3 years ago

p-alik commented 4 years ago

The PR aims to fix #301. My tests are based on https://github.com/gearman/gearmand/issues/301#issuecomment-675059023

esabol commented 4 years ago

I was considering this same change, but I was worried it didn’t address the underlying problem. I think it’s worth a try though.

Sometime between last week and today, I seem to have lost the ability to restart gearman/gearmand builds in Travis CI. Anyone know why? Never mind. Signing out of Travis CI and signing back in again fixed this.

infraweavers commented 4 years ago

This doesn't seem to have completely solved the problem we mentioned in https://github.com/gearman/gearmand/issues/301#issuecomment-675924682 however like we said before, it's possible this a completely different thing from our original issue:

W0wsQHOriJ

We have had it take a few goes to crash out with the worker than doesn't timeout for some reason.

infraweavers commented 4 years ago

Additionally, now we've completely isolated the gearmand, it doesn't crash when testing that; however if you run worker_that_doesnt_timeout.pl twice, it will still end up with -1 jobs running even with this change in, although now we end up with 1 in Jobs Waiting as well: image

esabol commented 4 years ago

Well, you could add similar guards wherever job_running is decremented. There are only two locations:

https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L301 https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L385

But I think gearmand will still hang. Could you try that, @infraweavers ?

In order to get to -1, job_running had to have been decremented more than once, obviously. Is gearman_server_job_free( ) being called on the same job twice? I’m also kind of wondering what happens if you comment out https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L385 entirely.

p-alik commented 4 years ago

Well, you could add similar guards wherever job_running is decremented.

I did it yesterday, but the result wasn't much better:-(

There are only two locations:

https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L301

a1873eacd2527aeff88ea055164704f0b973175f

https://github.com/gearman/gearmand/blob/e2d76cf631414964decc2f22006a666e1c0e2424/libgearman-server/job.cc#L385

968e2836572a87a8002483a8d4991741096fe765

infraweavers commented 4 years ago

THis PR does at least seem to stop the jobs_running column from going negative in mini-test

SpamapS commented 3 years ago

Going to close this, but feel free to ping / re-open if you can add a regression test. I am still thinking there's something deeper and this is just treating the symptom.

infraweavers commented 3 years ago

Yeah sounds good