collectiveidea / delayed_job

Database based asynchronous priority queue system -- Extracted from Shopify
http://groups.google.com/group/delayed_job
MIT License
4.82k stars 956 forks source link

Jobs firing twice #866

Closed renatodex closed 2 months ago

renatodex commented 8 years ago

Hi guys, I have an application using Delayed Job 3.0.5 (its old), and we are having a strange bug.

Sometimes and often, the same job is sent simultaneously to more than one dyno of our application (we are hosted on heroku. its the same queue, but it fires twice across the dynos)

Heres a screenshot that demonstrate the problem:

1922852-3d080ff1-44d5-4849-84a4-9bc45f8e24c0-delayedjob

In the first column you can see a Log inserted right before the DelayedJob "delay" method. (means that the request was executed only once)

In the second column you can see the code from the "delayed" code, firing 2 Logs originated from one single delayed method. (means that the job itself was somehow executed twice)

This is the only point on our whole application where this piece of code is executed.

The problem is: We are only delaying the method one time, but somehow, the method is sent to "task_worker.1" and "task_worker.2", and at the very same time!

Of course we should update our DelayedJob, but i wanted to confirm if it was a bug in 3.0.5. This application is running a lot of heavy tasks, i didnt wanted to upgrate blindly without properly understand the problem.

Thanks!

albus522 commented 8 years ago

Either the job is erroring the first time and getting re-run or you have the same identifier set for both workers.

renatodex commented 8 years ago

What you mean with "same identifier set for both workers"?

albus522 commented 8 years ago

Using --identifier=XXX when starting DJ.

LucasCioffi commented 8 years ago

Using delayed_job (4.1.1) I ran into the same issue where a job was fired twice on Heroku. I start the worker with this code in /script/delayed_job:

require File.expand_path(File.join(File.dirname(__FILE__), '..', 'config', 'environment'))
require 'delayed/command'
Delayed::Command.new(ARGV).daemonize

Here is the Procfile:

web: bundle exec puma -C config/puma.rb
worker: bundle exec rake jobs:work

The job didn't cause an error, so according to @albus522's suggestion, I might be starting the workers with the same identifier. Where should I look for that identifier?

After inspecting the log, it looks like there were two HTTP requests to the server which fired off the same job twice, so it doesn't seem to be a DJ error, but I thought I'd check in here, in case I have a problem with an identifier above.

Do I need to make a correction above? Thanks for a great gem!

albus522 commented 8 years ago

How many worker dynos are you running?

LucasCioffi commented 8 years ago

@albus522 It was just one worker dyno.

albus522 commented 8 years ago

@LucasCioffi Also I didn't read your post fully. I missed the 2 identical jobs part. Your configuration should be good.

seashellski commented 8 years ago

We are experiencing this issue in our environment using delayed_job 4.1.1.

Our jobs make GET requests to a very slow web service. Some GET requests can take 4-8 seconds to respond. Some jobs make few requests, others make a lot. Our issue surfaces for the jobs that make a lot of these GET requests.

We are running 4 delayed_job workers, distributed across 2 servers:

SERVER A:
  delayed_job.0
  delayed_job.1

SERVER B
  delayed_job.0
  delayed_job.1

Note that the worker IDs on each individual server are unique; however, the same IDs are used on both servers.

Jobs are configured to run for 4 hours and are allowed to retry up to 3 times.

Just today, a job was started and picked up by 1 of the 4 DJ workers; 13 hours later, 3 of the 4 available DJ workers were executing the same job and the attempt count was still 0.

A couple of questions:

After 13 hours, the DJ record for this job looks like this:

=> #<Delayed::Backend::ActiveRecord::Job id: 205418, priority: 0, attempts: 0, handler: "--- !ruby/struct:ScheduledJob\ntree: !ruby/ob...", last_error: nil, run_at: "2016-04-28 13:08:03", locked_at: "2016-04-28 21:08:14", failed_at: nil, locked_by: "delayed_job.0 host:ip-132-81-3-74 pid:18893", queue: "season", created_at: "2016-04-28 13:07:58", updated_at: "2016-04-28 13:07:58">
albus522 commented 8 years ago

Every worker MUST have a unique id, no matter what server they are running on. The id is used to identify jobs that belong to the worker. So if the worker dies and is restarted, it can immediately start working on the job it had again, without waiting for the job to timeout.

seashellski commented 8 years ago

Thanks @albus522 - I'll update our deployment to make sure IDs are universally unique.

Your explanation still has me confused by how, of the four workers, three were executing the same job...

Recall that we only have two worker IDs (0 and 1). If three workers were running the same job, then that would mean that either both workers with ID 1 were running the same job, along with one of the workers with ID 0 or vice versa.

If the DJ framework relies on the ID to know which worker should be executing a job, then what would explain why workers, with different IDs, were executing the same job?

albus522 commented 8 years ago

@seashellski It sounds like you have a job that deadlocks the process and prevents ruby's timeout from firing, yes there are ways to do that. Leaving the process forever executing the job, then the next process picks it up and repeats the problem. Rinse and repeat

seashellski commented 8 years ago

They aren't deadlocking; but they are making calls to very latent web services (90 second RTT is not uncommon) - hence the background task ;)

Does this make sense as an explanation?

One of these calls is in progress in worker (A) at the same time the job is meant to time out. Worker (B) comes along and thinks the job has timed out, and begins executing the job. The call in worker (A) completes and it continues without knowing that worker (B) has begun processing.

albus522 commented 8 years ago

I believe IO is one of the things that can cause timeouts to behave weird. So that could be possible

rossinkiwi commented 8 years ago

I had an issue with recurring tasks running more than once on my server. I tracked it down to a problem with how I was initializing the recurring tasks and the interaction with server reboots. I was using an initializer to set up the recurring jobs, ie:

 # config/initializers/recurring_jobs.rb
 WeeklyTasks.schedule!

I was using unattended-upgrades to automatically load security updates and reboot my server when necessary. When the server rebooted, the initializer was re-run and the delayed-jobs table was loaded with duplicate entries for the WeeklyTasks job, causing the job to run twice.

The solution was to change the way I initialize the recurring jobs. Instead of putting the initializer in config/initializers, I instead use a rake task that I run as part of my deploy script:

# lib/tasks/schedule_recurring_jobs.rake
# to run: rake schedule_recurring_jobs:jobs
namespace :schedule_recurring_jobs do
  desc "Initialize delayed_job_recurring tasks"
  task jobs: :environment do
    WeeklyTasks.schedule!
  end
end
hiattp commented 7 years ago

We're also seeing this issue on delayed_job 4.1.1. We have two delayed job processes (identifiers delayed_job.1 and delayed_job.0) on separate EC2 instances. Looking at the logs side by side you can see server on the left enqueue the job and then both delayed job processes pick it up:

screen shot 2016-12-23 at 1 18 41 pm

This happens about 80% of the time, and the jobs generally run about five minutes apart.

albus522 commented 7 years ago

@hiattp I suspect your max run time is set around 5 minutes. After that period expires the job is considered available

hiattp commented 7 years ago

Interesting yeah that sounds like a likely culprit. Seems like it should fail more aggressively or at least generate a WARN log if it makes the job fair game for other processes 5 min into it. Didn't even know about that option

On Dec 23, 2016, at 10:20 PM, David Genord II notifications@github.com wrote:

@hiattp I suspect your max run time is set around 5 minutes. After that period expires the job is considered available

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

albus522 commented 7 years ago

Normally the Ruby timeout which wraps all jobs takes care of that. However, timeout is not 100% reliable and there are ways to cause the timeout to not trigger properly. Network IO is one of the things that can sometimes do strange things to Ruby's timeout handling.

patrickpilch commented 7 years ago

I've also been bitten by this issue due to the application being containerized. The Linux kernel's process name-spacing caused the processes in the container to start at PID 0. And with multiple delayed_job workers, there was a good chance that many shared the same PID across containers. All the containers happened to also have the same hostname, which meant that coupled with the worker name generation logic, there were many duplicate workers.

My quick & dirty solution was a plugin that hooked into the before :execute lifecycle and augmented the worker name with a UUIDv4. It would be nice if you could pass a closure to Delayed::Worker.worker_name in an initializer though...

sterankin commented 7 years ago

I have had a similar issue. I am running multiple instances of a Rails application, and the delayed jobs are scheduled by the initializer.

Now, because the jobs are all scheduled on each instance, I assume they are all trying to run the delayed job from the delayed jobs table.

It's seems to be a race condition, where the first one to pick up and lock the jobs wins.

However, I am seeing cases where different application instances pick up the same job at the same time, mere milliseconds apart.

It didn't happen all the time, but it happens often enough to be an issue. I assume both jobs run because it doesn't get locked in time so both instances see the job as available.

Has anyone encountered this and is there a recommended solution?

amorimlucas commented 7 years ago

Hey @albus522 I'm getting this issue as well. My rails app is deployed with 2 workers (inside a Dokku web container), with different PIDs. I wasn't getting this error before but yesterday an important job (that processes billing) got scheduled twice at the exact same time, and the code was processed twice (even though it is idempotent, meaning they were executed at the exact same time). So my test orders were processed twice... What could be causing this? (BTW, the job sends out an email with a standard rails mailer).

hiattp commented 7 years ago

This is probably uncommon but I've seen this happen in situations with multiple servers where the server clocks aren't accurate/synced. Just something else to check.

amorimlucas commented 7 years ago

My app is running in a Dokku container (single Digital Ocean droplet), but it looks like both the container and nginx server have their times/time zones synced.

RedNoo commented 6 years ago

@amorimlucas did you find any solution? I am using dokku, too. I am seeing console following lines. [Worker(host:k pid:4962)] Job SimpleJob (id=9305) (queue=Default) RUNNING [Worker(host:k pid:4962)] Job SimpleJob (id=9305) (queue=Default) COMPLETED after 0.7371 [Worker(host:k pid:4962)] Job ActiveJob::QueueAdapters::DelayedJobAdapter::JobWrapper (id=9306) (queue=Default) RUNNING [Worker(host:k pid:4962)] Job ActiveJob::QueueAdapters::DelayedJobAdapter::JobWrapper (id=9306) (queue=Default) COMPLETED after 0.5055 [Worker(host:k pid:4962)] 2 jobs processed at 1.5728 j/s, 0 failed

amorimlucas commented 6 years ago

@RedNoo My issue turned out to be that I had duplicate containers running in the background. Try running docker ps to check your containers. I also recommend checking out the dokku Slacker channel. They helped me out a ton in the chat.

albus522 commented 6 years ago

@RedNoo That log output is not duplicate jobs running. That is job 9305 starting and finishing and then 9306 starting and finishing.

Wylan commented 6 years ago

I believe we've run into this same issue, or at least a similar one when running multiple delayed job workers in one kubernetes pod. Even though they are in separate containers, they all have the same hostname and pid due to the way the kubernetes pods work.

Our current work around is a custom rake task that sets the worker name before starting it. Perhaps something similar could be added to the default set of tasks?

namespace :jobs do

  desc 'Start delayed_job worker with a custom name'
  task :work_with_name, [:name] => :environment_options do |_, args|
    args.with_defaults(name: SecureRandom.hex(5))
    # modified from line 149 in delayed/worker.rb
    worker_name = "#{("host:#{Socket.gethostname} pid:#{Process.pid}" rescue "pid:#{Process.pid}")} #{args[:name]}"[0...255]

    worker      = Delayed::Worker.new(@worker_options)
    worker.name = worker_name
    worker.start
  end

end

Then we invoke this in our containers like so:

rake jobs:work_with_name[worker-0] rake jobs:work_with_name[worker-1] etc.

kir commented 5 years ago

Hello, I've faced this problem on 3.0.5 almost immediately after starting several workers instead of one. A possible fix in my case was to change self.reserve method in Delayed::Backend::ActiveRecord.Job in delayed_job_active_record gem to

        def self.reserve(worker, max_run_time = Worker.max_run_time)
          # scope to filter to records that are "ready to run"
          readyScope = self.ready_to_run(worker.name, max_run_time)

          # scope to filter to the single next eligible job
          nextScope = readyScope.scoped
          nextScope = nextScope.scoped(:conditions => ['priority >= ?', Worker.min_priority]) if Worker.min_priority
          nextScope = nextScope.scoped(:conditions => ['priority <= ?', Worker.max_priority]) if Worker.max_priority
          nextScope = nextScope.scoped(:conditions => ["queue IN (?)", Worker.queues]) if Worker.queues.any?
          nextScope = nextScope.scoped.by_priority.limit(1)

          now = self.db_time_now
          job = nextScope.first
          return unless job
          job.with_lock do
            # Check still not locked by a parallel job:
            if readyScope.where(id: job.id).count == 1
              job.locked_at = now
              job.locked_by = worker.name
              job.save!
            else
              return
            end
          end
          job
        end

The changed part is marked with comment # Check still not locked by a parallel job:, this fixes the race condition, as far as I can tell.

Would be glad to hear what others think. Best,

albus522 commented 2 months ago

Closing stale