collectiveidea / delayed_job_active_record

ActiveRecord backend integration for DelayedJob 3.0+
MIT License
346 stars 336 forks source link

Job running on 2 workers at once after timeout #193

Open seandilda opened 3 years ago

seandilda commented 3 years ago

I've noticed an issue in production where a job is running on two different workers at the same time. This happens after the job ran over the max_run_time value.

I believe I've tracked this down to a race condition between the dead worker detection and the max_run_time timeout.

Relevant versions

delayed_job: 4.1.9 delayed_job_active_record: 4.1.6 rails / activerecord: 6.0.4 MySQL: 8.0.20

What I think is happening

Lets say you have worker A and worker B. These steps could happen in order

  1. Worker A is running job 1 that runs over max_run_time, the Timeout.timeout call raises an error.
  2. Worker B runs its SQL to get the next job and grabs job 1 since max_run_time has passed. This sets the locked_at and locked_by values
  3. Worker A finishes the error handling and clears locked_at and locked_by values on job 1.
  4. After the retry window, Worker A (or some other worker) starts running job 1 since the locked_at is now NULL.

How to reproduce

To reproduce this, I set the after method on my test job to sleep, ensuring worker B has enough time to run its next job query before the job cleans up. The race condition doesn't require the sleep, but the sleep ensures we always lose the race.

In my rails initializers:

# Make this short so the race condition is easy to reproduce
Delayed::Worker.max_run_time = 30

Test job:

class TestJob
  def after _job
    log 'after - start'
    # Sleep for some time that's longer than the standard 5 second poll window.
    # This doesn't have to be larger than 5 seconds to trigger, but having a larger value
    # guarantees we loose the race condition every time.
    sleep 6
    log 'after - end'
  end

  def log msg
    puts "#{Time.current}: #{msg}"
  end

  def perform
    log 'perform - start'
    # Sleep for longer than Delayed::Worker.max_run_time
    sleep 60
    log 'perform - stop'
  end
end

I started two workers with rake jobs:work, then ran the following code in a rails console

job = Delayed::Job.enqueue(TestJob.new)

loop do
  sleep 1
  job.reload
  puts "#{Time.current}: Locked by #{job.locked_by} at #{job.locked_at}"
end

Results

Here are output snippets from running the above code.

Worker A:

[Worker(host:bb31a17bbfca pid:614)] Starting job worker
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) RUNNING
2021-07-07 08:55:47 -0400: perform - start
2021-07-07 08:56:17 -0400: after - start
2021-07-07 08:56:23 -0400: after - end
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) FAILED (0 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 30 seconds)
[Worker(host:bb31a17bbfca pid:614)] 1 jobs processed at 0.0277 j/s, 1 failed
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) RUNNING
2021-07-07 08:56:33 -0400: perform - start
2021-07-07 08:57:03 -0400: after - start
2021-07-07 08:57:09 -0400: after - end
[Worker(host:bb31a17bbfca pid:614)] Job TestJob (id=45) FAILED (1 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 30 seconds)

Worker B:

[Worker(host:bb31a17bbfca pid:629)] Starting job worker
[Worker(host:bb31a17bbfca pid:629)] Job TestJob (id=45) RUNNING
2021-07-07 08:56:19 -0400: perform - start
2021-07-07 08:56:49 -0400: after - start
2021-07-07 08:56:55 -0400: after - end
[Worker(host:bb31a17bbfca pid:629)] Job TestJob (id=45) FAILED (0 prior attempts) with Delayed::WorkerTimeout: execution expired (Delayed::Worker.max_run_time is only 30 seconds)
[Worker(host:bb31a17bbfca pid:629)] 1 jobs processed at 0.0277 j/s, 1 failed

From this output, we can see that Worker A ran the job first. While Worker A was in the after method, Worker B started the job. Worker A then started the job later while Worker B was still running it.

Here's a snippet from the rails console showing the state changes: (I removed the SQL query logs to make it easier to read and added a few comments)

2021-07-07 08:56:15 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:16 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:17 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:18 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400
2021-07-07 08:56:19 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:55:47 -0400

# Worker B grabs the job and resets locked_at and locked_by

2021-07-07 08:56:20 -0400: Locked by host:bb31a17bbfca pid:629 at 2021-07-07 08:56:19 -0400
2021-07-07 08:56:21 -0400: Locked by host:bb31a17bbfca pid:629 at 2021-07-07 08:56:19 -0400
2021-07-07 08:56:22 -0400: Locked by host:bb31a17bbfca pid:629 at 2021-07-07 08:56:19 -0400

# Worker A finishes error handling and clears locked_at and locked_by

2021-07-07 08:56:23 -0400: Locked by  at
2021-07-07 08:56:24 -0400: Locked by  at
2021-07-07 08:56:25 -0400: Locked by  at
2021-07-07 08:56:26 -0400: Locked by  at
2021-07-07 08:56:27 -0400: Locked by  at
2021-07-07 08:56:28 -0400: Locked by  at
2021-07-07 08:56:29 -0400: Locked by  at
2021-07-07 08:56:30 -0400: Locked by  at
2021-07-07 08:56:31 -0400: Locked by  at
2021-07-07 08:56:32 -0400: Locked by  at

# Worker A starts running the job again

2021-07-07 08:56:33 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:34 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:35 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:36 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
2021-07-07 08:56:37 -0400: Locked by host:bb31a17bbfca pid:614 at 2021-07-07 08:56:33 -0400
eric-hemasystems commented 1 year ago

I've run into a similar issue. I also see #136 is reporting a similar issue with a bit of a work-around so wanted to note that here.