We observed this behaviour in our environment, where the same job was being picked up by more than one worker. After some digging around, it seemed that the culprit were timed out jobs that weren't really halting after the timeout period.
Taking a look at the source code, we think we traced the culprit to this Timeout library quirk. By providing a specific timeout class to the Timeout module, we're allowing for the exception to be rescued inside the block / job. Depending on the execution state of the thread running the job (i.e. if when the exception signal is received, the thread is in a code block wrapped in a rescue StandardError), the job may not halt.
As the max_run_time value was reached, another worker is then entitled to pick up that same job, which ultimately leads to multiple workers potentially executing the same job, at the same time.
We observed this behaviour in our environment, where the same job was being picked up by more than one worker. After some digging around, it seemed that the culprit were timed out jobs that weren't really halting after the timeout period.
Taking a look at the source code, we think we traced the culprit to this Timeout library quirk. By providing a specific timeout class to the
Timeout
module, we're allowing for the exception to be rescued inside the block / job. Depending on the execution state of the thread running the job (i.e. if when the exception signal is received, the thread is in a code block wrapped in arescue StandardError
), the job may not halt.As the
max_run_time
value was reached, another worker is then entitled to pick up that same job, which ultimately leads to multiple workers potentially executing the same job, at the same time.I've drafted a quick POC which can be found here - https://github.com/diogoosorio/delayed-job-timeout-example
I believe some of these issues can probably be traced back to this problem.
\cc @magicknot