collectiveidea / delayed_job

Database based asynchronous priority queue system -- Extracted from Shopify
http://groups.google.com/group/delayed_job
MIT License
4.82k stars 954 forks source link

2 workers picking up the same job #658

Open carstengehling opened 10 years ago

carstengehling commented 10 years ago

_Updated: tested with delayedjob 4.0.1 - problem still persists

With the gems:

rails 3.2.13
delayed_job 4.0.0
delayed_job_active_record 4.0.0
pg 0.17.1

and PostgreSQL v. 9.1.11

Running 2 workers, they pick up and process the same job:

2014-05-12T15:04:00+0200: [Worker(delayed_job.1 host:waldseemuller pid:8470)] Job Class#perform (id=56439) RUNNING
2014-05-12T15:04:00+0200: [Worker(delayed_job.0 host:waldseemuller pid:8464)] Job Class#perform (id=56439) RUNNING
2014-05-12T15:04:00+0200: [Worker(delayed_job.0 host:waldseemuller pid:8464)] Job Class#perform (id=56439) COMPLETED after 0.1915
2014-05-12T15:04:00+0200: [Worker(delayed_job.1 host:waldseemuller pid:8470)] Job Class#perform (id=56439) COMPLETED after 0.2749

Is there a way around this?

albus522 commented 10 years ago

First try upgrading. If the problem persists, reopen

carstengehling commented 10 years ago

Sorry - I've tried updating to 4.0.1 and the problem still persist.

2014-05-12T16:57:24+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:24814)] Job Class#perform (id=19361) RUNNING
2014-05-12T16:57:24+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:24807)] Job Class#perform (id=19361) RUNNING
2014-05-12T16:57:27+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:24807)] Job Class#perform (id=19361) COMPLETED after 3.6808
2014-05-12T16:57:28+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:24814)] Job Class#perform (id=19361) COMPLETED after 3.7396
carstengehling commented 10 years ago

How do I reopen?

albus522 commented 10 years ago

What DJ configuration options do you have set? What is the exact command you are using to run DJ?

carstengehling commented 10 years ago

From config/initializers/delayed_job.rb:

Delayed::Worker.max_run_time =  24.hours
Delayed::Worker.max_attempts = 1
Delayed::Worker.destroy_failed_jobs = false

The workers are started with:

script/delayed_job start -n 2
carstengehling commented 10 years ago

I just tried with the latest commits on master branch - same result:

2014-05-12T18:44:24+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Starting job worker
2014-05-12T18:44:24+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Starting job worker
2014-05-12T18:44:26+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19788) RUNNING
2014-05-12T18:44:26+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19788) RUNNING
2014-05-12T18:44:28+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19788) COMPLETED after 1.9184
2014-05-12T18:44:28+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19788) COMPLETED after 1.9570

2014-05-12T18:44:30+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19795) RUNNING
2014-05-12T18:44:30+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19795) RUNNING
2014-05-12T18:44:31+0200: [Worker(delayed_job.0 host:carsten-ThinkPad-X1-Carbon pid:27915)] Job Class#perform (id=19795) COMPLETED after 0.6636
2014-05-12T18:44:31+0200: [Worker(delayed_job.1 host:carsten-ThinkPad-X1-Carbon pid:27922)] Job Class#perform (id=19795) COMPLETED after 0.7026

The job is defined like this:

class AnalysisJob

  def self.perform(analysis_id)
    analysis = Analysis.find(analysis_id)
    if analysis.transport_form == 'private'
      PrivateTransportAnalysisProcess.new(analysis).run
    else
      PublicTransportAnalysisProcess.new(analysis).run
    end
  end

end

The jobs are created like this:

AnalysisJob.delay.perform(id)
betamatt commented 10 years ago

@carstengehling can you create a small sample app that reproduces the issue?

kubbing commented 10 years ago

similar thing happening to me :/

jnimety commented 9 years ago

I have a lead...

Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 12345

Job runs and fails during row delete, another worker eventually picks it up and runs the job again. Noticed this about 10 times while our queue was processing approximately 4000 jobs. We have 6 workers.

jnimety commented 9 years ago

delayed_job (4.0.6) btw

jhirbour commented 9 years ago

I've added the following index to match the WHERE clause that DJ is using for the row locking and unlocking. This should help with the speed of locking and unlocking rows.

add_index :delayed_jobs, [:run_at, :locked_at, :locked_by, :failed_at], :name => 'index_delayed_jobs_for_better_row_locking'
defeated commented 9 years ago

@jnimety

Running multiple workers, we're running into this Deadlock error frequently as well when there are a ton of jobs in the queue.

mysql2 delayed_job 4.0.6 daemons 1.2.2

Worker(host:app01 pid:4191)] Job Searchable::SearchDestroyJob (id=15992398) FAILED (0 prior attempts) with ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 15992398
rake aborted!

ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `last_error` = 'Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 15992398\n

which is causing the workers to crash and require a restart. (we're using the --monitor option so this happens automatically after some amount of time)

Looks related to https://github.com/collectiveidea/delayed_job_active_record/issues/63

sventantau commented 8 years ago

I observed the same symptoms.

Take a look at: https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb#L90

I am using the activerecord-postgis-adapter. So for me: ActiveRecord::Base.connection.adapter_name => "PostGIS"

I monkey patched this and have not seen any 2 workers picking up the same job since then.

hth

amalagaura commented 8 years ago

@sventantau We just ran into this same thing. Thank you for posting this, I would not have found this so soon. I will put a pull request for this.