collectiveidea / delayed_job

Database based asynchronous priority queue system -- Extracted from Shopify
http://groups.google.com/group/delayed_job
MIT License
4.81k stars 955 forks source link

Timeout doesn't halt job when max_run_time is reached #1180

Closed diogoosorio closed 9 months ago

diogoosorio commented 1 year ago

We observed this behaviour in our environment, where the same job was being picked up by more than one worker. After some digging around, it seemed that the culprit were timed out jobs that weren't really halting after the timeout period.

Taking a look at the source code, we think we traced the culprit to this Timeout library quirk. By providing a specific timeout class to the Timeout module, we're allowing for the exception to be rescued inside the block / job. Depending on the execution state of the thread running the job (i.e. if when the exception signal is received, the thread is in a code block wrapped in a rescue StandardError), the job may not halt.

As the max_run_time value was reached, another worker is then entitled to pick up that same job, which ultimately leads to multiple workers potentially executing the same job, at the same time.

I've drafted a quick POC which can be found here - https://github.com/diogoosorio/delayed-job-timeout-example

I believe some of these issues can probably be traced back to this problem.

\cc @magicknot

diogoosorio commented 9 months ago

The issue is there, but the proposed solution isn't great and the recommendation is to fix this at a different level (see #1181). Closing the issue.