Closed renatodex closed 2 months ago
Either the job is erroring the first time and getting re-run or you have the same identifier set for both workers.
What you mean with "same identifier set for both workers"?
Using --identifier=XXX
when starting DJ.
Using delayed_job (4.1.1)
I ran into the same issue where a job was fired twice on Heroku. I start the worker with this code in /script/delayed_job
:
require File.expand_path(File.join(File.dirname(__FILE__), '..', 'config', 'environment'))
require 'delayed/command'
Delayed::Command.new(ARGV).daemonize
Here is the Procfile:
web: bundle exec puma -C config/puma.rb
worker: bundle exec rake jobs:work
The job didn't cause an error, so according to @albus522's suggestion, I might be starting the workers with the same identifier. Where should I look for that identifier?
After inspecting the log, it looks like there were two HTTP requests to the server which fired off the same job twice, so it doesn't seem to be a DJ error, but I thought I'd check in here, in case I have a problem with an identifier above.
Do I need to make a correction above? Thanks for a great gem!
How many worker dynos are you running?
@albus522 It was just one worker dyno.
@LucasCioffi Also I didn't read your post fully. I missed the 2 identical jobs part. Your configuration should be good.
We are experiencing this issue in our environment using delayed_job 4.1.1.
Our jobs make GET requests to a very slow web service. Some GET requests can take 4-8 seconds to respond. Some jobs make few requests, others make a lot. Our issue surfaces for the jobs that make a lot of these GET requests.
We are running 4 delayed_job workers, distributed across 2 servers:
SERVER A:
delayed_job.0
delayed_job.1
SERVER B
delayed_job.0
delayed_job.1
Note that the worker IDs on each individual server are unique; however, the same IDs are used on both servers.
Jobs are configured to run for 4 hours and are allowed to retry up to 3 times.
Just today, a job was started and picked up by 1 of the 4 DJ workers; 13 hours later, 3 of the 4 available DJ workers were executing the same job and the attempt count was still 0.
A couple of questions:
After 13 hours, the DJ record for this job looks like this:
=> #<Delayed::Backend::ActiveRecord::Job id: 205418, priority: 0, attempts: 0, handler: "--- !ruby/struct:ScheduledJob\ntree: !ruby/ob...", last_error: nil, run_at: "2016-04-28 13:08:03", locked_at: "2016-04-28 21:08:14", failed_at: nil, locked_by: "delayed_job.0 host:ip-132-81-3-74 pid:18893", queue: "season", created_at: "2016-04-28 13:07:58", updated_at: "2016-04-28 13:07:58">
Every worker MUST have a unique id, no matter what server they are running on. The id is used to identify jobs that belong to the worker. So if the worker dies and is restarted, it can immediately start working on the job it had again, without waiting for the job to timeout.
Thanks @albus522 - I'll update our deployment to make sure IDs are universally unique.
Your explanation still has me confused by how, of the four workers, three were executing the same job...
Recall that we only have two worker IDs (0 and 1). If three workers were running the same job, then that would mean that either both workers with ID 1 were running the same job, along with one of the workers with ID 0 or vice versa.
If the DJ framework relies on the ID to know which worker should be executing a job, then what would explain why workers, with different IDs, were executing the same job?
@seashellski It sounds like you have a job that deadlocks the process and prevents ruby's timeout from firing, yes there are ways to do that. Leaving the process forever executing the job, then the next process picks it up and repeats the problem. Rinse and repeat
They aren't deadlocking; but they are making calls to very latent web services (90 second RTT is not uncommon) - hence the background task ;)
Does this make sense as an explanation?
One of these calls is in progress in worker (A) at the same time the job is meant to time out. Worker (B) comes along and thinks the job has timed out, and begins executing the job. The call in worker (A) completes and it continues without knowing that worker (B) has begun processing.
I believe IO is one of the things that can cause timeouts to behave weird. So that could be possible
I had an issue with recurring tasks running more than once on my server. I tracked it down to a problem with how I was initializing the recurring tasks and the interaction with server reboots. I was using an initializer to set up the recurring jobs, ie:
# config/initializers/recurring_jobs.rb
WeeklyTasks.schedule!
I was using unattended-upgrades to automatically load security updates and reboot my server when necessary. When the server rebooted, the initializer was re-run and the delayed-jobs table was loaded with duplicate entries for the WeeklyTasks job, causing the job to run twice.
The solution was to change the way I initialize the recurring jobs. Instead of putting the initializer in config/initializers, I instead use a rake task that I run as part of my deploy script:
# lib/tasks/schedule_recurring_jobs.rake
# to run: rake schedule_recurring_jobs:jobs
namespace :schedule_recurring_jobs do
desc "Initialize delayed_job_recurring tasks"
task jobs: :environment do
WeeklyTasks.schedule!
end
end
We're also seeing this issue on delayed_job 4.1.1
. We have two delayed job processes (identifiers delayed_job.1
and delayed_job.0
) on separate EC2 instances. Looking at the logs side by side you can see server on the left enqueue the job and then both delayed job processes pick it up:
This happens about 80% of the time, and the jobs generally run about five minutes apart.
@hiattp I suspect your max run time is set around 5 minutes. After that period expires the job is considered available
Interesting yeah that sounds like a likely culprit. Seems like it should fail more aggressively or at least generate a WARN log if it makes the job fair game for other processes 5 min into it. Didn't even know about that option
On Dec 23, 2016, at 10:20 PM, David Genord II notifications@github.com wrote:
@hiattp I suspect your max run time is set around 5 minutes. After that period expires the job is considered available
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Normally the Ruby timeout which wraps all jobs takes care of that. However, timeout is not 100% reliable and there are ways to cause the timeout to not trigger properly. Network IO is one of the things that can sometimes do strange things to Ruby's timeout handling.
I've also been bitten by this issue due to the application being containerized. The Linux kernel's process name-spacing caused the processes in the container to start at PID 0. And with multiple delayed_job workers, there was a good chance that many shared the same PID across containers. All the containers happened to also have the same hostname, which meant that coupled with the worker name generation logic, there were many duplicate workers.
My quick & dirty solution was a plugin that hooked into the before
:execute
lifecycle and augmented the worker name with a UUIDv4. It would be nice if you could pass a closure to Delayed::Worker.worker_name
in an initializer though...
I have had a similar issue. I am running multiple instances of a Rails application, and the delayed jobs are scheduled by the initializer.
Now, because the jobs are all scheduled on each instance, I assume they are all trying to run the delayed job from the delayed jobs table.
It's seems to be a race condition, where the first one to pick up and lock the jobs wins.
However, I am seeing cases where different application instances pick up the same job at the same time, mere milliseconds apart.
It didn't happen all the time, but it happens often enough to be an issue. I assume both jobs run because it doesn't get locked in time so both instances see the job as available.
Has anyone encountered this and is there a recommended solution?
Hey @albus522 I'm getting this issue as well. My rails app is deployed with 2 workers (inside a Dokku web container), with different PIDs. I wasn't getting this error before but yesterday an important job (that processes billing) got scheduled twice at the exact same time, and the code was processed twice (even though it is idempotent, meaning they were executed at the exact same time). So my test orders were processed twice... What could be causing this? (BTW, the job sends out an email with a standard rails mailer).
This is probably uncommon but I've seen this happen in situations with multiple servers where the server clocks aren't accurate/synced. Just something else to check.
My app is running in a Dokku container (single Digital Ocean droplet), but it looks like both the container and nginx server have their times/time zones synced.
@amorimlucas did you find any solution? I am using dokku, too. I am seeing console following lines. [Worker(host:k pid:4962)] Job SimpleJob (id=9305) (queue=Default) RUNNING [Worker(host:k pid:4962)] Job SimpleJob (id=9305) (queue=Default) COMPLETED after 0.7371 [Worker(host:k pid:4962)] Job ActiveJob::QueueAdapters::DelayedJobAdapter::JobWrapper (id=9306) (queue=Default) RUNNING [Worker(host:k pid:4962)] Job ActiveJob::QueueAdapters::DelayedJobAdapter::JobWrapper (id=9306) (queue=Default) COMPLETED after 0.5055 [Worker(host:k pid:4962)] 2 jobs processed at 1.5728 j/s, 0 failed
@RedNoo My issue turned out to be that I had duplicate containers running in the background. Try running docker ps
to check your containers. I also recommend checking out the dokku Slacker channel. They helped me out a ton in the chat.
@RedNoo That log output is not duplicate jobs running. That is job 9305 starting and finishing and then 9306 starting and finishing.
I believe we've run into this same issue, or at least a similar one when running multiple delayed job workers in one kubernetes pod. Even though they are in separate containers, they all have the same hostname and pid due to the way the kubernetes pods work.
Our current work around is a custom rake task that sets the worker name before starting it. Perhaps something similar could be added to the default set of tasks?
namespace :jobs do
desc 'Start delayed_job worker with a custom name'
task :work_with_name, [:name] => :environment_options do |_, args|
args.with_defaults(name: SecureRandom.hex(5))
# modified from line 149 in delayed/worker.rb
worker_name = "#{("host:#{Socket.gethostname} pid:#{Process.pid}" rescue "pid:#{Process.pid}")} #{args[:name]}"[0...255]
worker = Delayed::Worker.new(@worker_options)
worker.name = worker_name
worker.start
end
end
Then we invoke this in our containers like so:
rake jobs:work_with_name[worker-0]
rake jobs:work_with_name[worker-1]
etc.
Hello,
I've faced this problem on 3.0.5 almost immediately after starting several workers instead of one.
A possible fix in my case was to change self.reserve
method in Delayed::Backend::ActiveRecord.Job
in delayed_job_active_record gem to
def self.reserve(worker, max_run_time = Worker.max_run_time)
# scope to filter to records that are "ready to run"
readyScope = self.ready_to_run(worker.name, max_run_time)
# scope to filter to the single next eligible job
nextScope = readyScope.scoped
nextScope = nextScope.scoped(:conditions => ['priority >= ?', Worker.min_priority]) if Worker.min_priority
nextScope = nextScope.scoped(:conditions => ['priority <= ?', Worker.max_priority]) if Worker.max_priority
nextScope = nextScope.scoped(:conditions => ["queue IN (?)", Worker.queues]) if Worker.queues.any?
nextScope = nextScope.scoped.by_priority.limit(1)
now = self.db_time_now
job = nextScope.first
return unless job
job.with_lock do
# Check still not locked by a parallel job:
if readyScope.where(id: job.id).count == 1
job.locked_at = now
job.locked_by = worker.name
job.save!
else
return
end
end
job
end
The changed part is marked with comment # Check still not locked by a parallel job:
, this fixes the race condition, as far as I can tell.
Would be glad to hear what others think. Best,
Closing stale
Hi guys, I have an application using Delayed Job 3.0.5 (its old), and we are having a strange bug.
Sometimes and often, the same job is sent simultaneously to more than one dyno of our application (we are hosted on heroku. its the same queue, but it fires twice across the dynos)
Heres a screenshot that demonstrate the problem:
In the first column you can see a Log inserted right before the DelayedJob "delay" method. (means that the request was executed only once)
In the second column you can see the code from the "delayed" code, firing 2 Logs originated from one single delayed method. (means that the job itself was somehow executed twice)
This is the only point on our whole application where this piece of code is executed.
The problem is: We are only delaying the method one time, but somehow, the method is sent to "task_worker.1" and "task_worker.2", and at the very same time!
Of course we should update our DelayedJob, but i wanted to confirm if it was a bug in 3.0.5. This application is running a lot of heavy tasks, i didnt wanted to upgrate blindly without properly understand the problem.
Thanks!