Timeout doesn't halt job when max_run_time is reached

We observed this behaviour in our environment, where the same job was being picked up by more than one worker. After some digging around, it seemed that the culprit were timed out jobs that weren't really halting after the timeout period.

Taking a look at the source code, we think we traced the culprit to this Timeout library quirk. By providing a specific timeout class to the Timeout module, we're allowing for the exception to be rescued inside the block / job. Depending on the execution state of the thread running the job (i.e. if when the exception signal is received, the thread is in a code block wrapped in a rescue StandardError), the job may not halt.

As the max_run_time value was reached, another worker is then entitled to pick up that same job, which ultimately leads to multiple workers potentially executing the same job, at the same time.

I've drafted a quick POC which can be found here - https://github.com/diogoosorio/delayed-job-timeout-example

I believe some of these issues can probably be traced back to this problem.

\cc @magicknot

collectiveidea / delayed_job

Timeout doesn't halt job when max_run_time is reached #1180