I've noticed an issue in production where a job is running on two different workers at the same time. This happens after the job ran over the max_run_time value.
I believe I've tracked this down to a race condition between the dead worker detection and the max_run_time timeout.
Lets say you have worker A and worker B. These steps could happen in order
Worker A is running job 1 that runs over max_run_time, the Timeout.timeout call raises an error.
Worker B runs its SQL to get the next job and grabs job 1 since max_run_time has passed. This sets the locked_at and locked_by values
Worker A finishes the error handling and clears locked_at and locked_by values on job 1.
After the retry window, Worker A (or some other worker) starts running job 1 since the locked_at is now NULL.
How to reproduce
To reproduce this, I set the after method on my test job to sleep, ensuring worker B
has enough time to run its next job query before the job cleans up. The race condition
doesn't require the sleep, but the sleep ensures we always lose the race.
In my rails initializers:
# Make this short so the race condition is easy to reproduce
Delayed::Worker.max_run_time = 30
Test job:
class TestJob
def after _job
log 'after - start'
# Sleep for some time that's longer than the standard 5 second poll window.
# This doesn't have to be larger than 5 seconds to trigger, but having a larger value
# guarantees we loose the race condition every time.
sleep 6
log 'after - end'
end
def log msg
puts "#{Time.current}: #{msg}"
end
def perform
log 'perform - start'
# Sleep for longer than Delayed::Worker.max_run_time
sleep 60
log 'perform - stop'
end
end
I started two workers with rake jobs:work, then ran the following code in a rails console
job = Delayed::Job.enqueue(TestJob.new)
loop do
sleep 1
job.reload
puts "#{Time.current}: Locked by #{job.locked_by} at #{job.locked_at}"
end
Results
Here are output snippets from running the above code.
Worker A:
[Worker(host:bb31a17bbfca pid:614)] Starting job worker
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) RUNNING
2021-07-07 08:55:47 -0400: perform - start
2021-07-07 08:56:17 -0400: after - start
2021-07-07 08:56:23 -0400: after - end
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) FAILED (0 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 30 seconds)
[Worker(host:bb31a17bbfca pid:614)] 1 jobs processed at 0.0277 j/s, 1 failed
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) RUNNING
2021-07-07 08:56:33 -0400: perform - start
2021-07-07 08:57:03 -0400: after - start
2021-07-07 08:57:09 -0400: after - end
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) FAILED (1 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 30 seconds)
Worker B:
[Worker(host:bb31a17bbfca pid:629)] Starting job worker
[Worker(host:bb31a17bbfca pid:629)] Job TestJob (id=45) RUNNING
2021-07-07 08:56:19 -0400: perform - start
2021-07-07 08:56:49 -0400: after - start
2021-07-07 08:56:55 -0400: after - end
[Worker(host:bb31a17bbfca pid:629)] Job TestJob (id=45) FAILED (0 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 30 seconds)
[Worker(host:bb31a17bbfca pid:629)] 1 jobs processed at 0.0277 j/s, 1 failed
From this output, we can see that Worker A ran the job first. While Worker A was in
the after method, Worker B started the job. Worker A then started the job later while
Worker B was still running it.
Here's a snippet from the rails console showing the state changes:
(I removed the SQL query logs to make it easier to read and added a few comments)
2021-07-07 08:56:15 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:16 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:17 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:18 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:19 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
# Worker B grabs the job and resets locked_at and locked_by
2021-07-07 08:56:20 -0400: Locked by host:bb31a17bbfca pid:629 at 2021-07-07 08:56:19 -0400
2021-07-07 08:56:21 -0400: Locked by host:bb31a17bbfca pid:629 at 2021-07-07 08:56:19 -0400
2021-07-07 08:56:22 -0400: Locked by host:bb31a17bbfca pid:629 at 2021-07-07 08:56:19 -0400
# Worker A finishes error handling and clears locked_at and locked_by
2021-07-07 08:56:23 -0400: Locked by at
2021-07-07 08:56:24 -0400: Locked by at
2021-07-07 08:56:25 -0400: Locked by at
2021-07-07 08:56:26 -0400: Locked by at
2021-07-07 08:56:27 -0400: Locked by at
2021-07-07 08:56:28 -0400: Locked by at
2021-07-07 08:56:29 -0400: Locked by at
2021-07-07 08:56:30 -0400: Locked by at
2021-07-07 08:56:31 -0400: Locked by at
2021-07-07 08:56:32 -0400: Locked by at
# Worker A starts running the job again
2021-07-07 08:56:33 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:34 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:35 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:36 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:37 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
I've noticed an issue in production where a job is running on two different workers at the same time. This happens after the job ran over the
max_run_time
value.I believe I've tracked this down to a race condition between the dead worker detection and the
max_run_time
timeout.Relevant versions
delayed_job: 4.1.9 delayed_job_active_record: 4.1.6 rails / activerecord: 6.0.4 MySQL: 8.0.20
What I think is happening
Lets say you have worker A and worker B. These steps could happen in order
max_run_time
, theTimeout.timeout
call raises an error.max_run_time
has passed. This sets thelocked_at
andlocked_by
valueslocked_at
andlocked_by
values on job 1.locked_at
is nowNULL
.How to reproduce
To reproduce this, I set the
after
method on my test job to sleep, ensuring worker B has enough time to run its next job query before the job cleans up. The race condition doesn't require the sleep, but the sleep ensures we always lose the race.In my rails initializers:
Test job:
I started two workers with
rake jobs:work
, then ran the following code in a rails consoleResults
Here are output snippets from running the above code.
Worker A:
Worker B:
From this output, we can see that Worker A ran the job first. While Worker A was in the
after
method, Worker B started the job. Worker A then started the job later while Worker B was still running it.Here's a snippet from the rails console showing the state changes: (I removed the SQL query logs to make it easier to read and added a few comments)