collectiveidea / delayed_job

Database based asynchronous priority queue system -- Extracted from Shopify
http://groups.google.com/group/delayed_job
MIT License
4.82k stars 956 forks source link

With 500K jobs in the delayed_jobs table, it gets really slow #581

Open amitsaxena opened 11 years ago

amitsaxena commented 11 years ago

With around 500K jobs in the delayed_jobs table, it gets really slow, to the point of being unusable. My slow query log is filled with these:

SET timestamp=1379681371; UPDATE delayed_jobs SET locked_at = '2013-09-20 12:49:20', locked_by = 'delayed_job host:node1365 pid:20668' WHERE ((run_at <= '2013-09-20 12:49:20' AND (locked_at IS NULL OR locked_at < '2013-09-20 08:49:20') OR locked_by = 'delayed_job host:node1365 pid:20668') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1; // Time: 130920 8:49:36 // Query_time: 4.683968 Lock_time: 0.000069 Rows_sent: 0 Rows_examined: 1 SET timestamp=1379681376; DELETE FROM delayed_jobs WHERE delayed_jobs.id = 5233997; // Time: 130920 8:50:55 // Query_time: 6.367763 Lock_time: 0.000082 Rows_sent: 3 Rows_examined: 595935 SET timestamp=1379681455; SELECT COUNT(*) AS count_all, priority AS priority FROM delayed_jobs WHERE (run_at < '2013-09-20 12:50:48' and failed_at is NULL) GROUP BY priority;

I have the following indexes (added some more to speed up some queries that were showing up in the slow query log): mysql> show index from delayed_jobs; +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | delayed_jobs | 0 | PRIMARY | 1 | id | A | 628375 | NULL | NULL | | BTREE | | | delayed_jobs | 1 | delayed_jobs_priority | 1 | priority | A | 16 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_priority | 2 | run_at | A | 24168 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_lock | 1 | locked_at | A | 9 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_lock | 2 | locked_by | A | 9 | 100 | NULL | YES | BTREE | | | delayed_jobs | 1 | locked_by | 1 | locked_by | A | 9 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | failed_at | 1 | failed_at | A | 97 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | run_at | 1 | run_at | A | 13660 | NULL | NULL | YES | BTREE | | +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

But still, it is extremely slow.

amitsaxena commented 11 years ago

In case someone is looking for a quickfix to clear the queue, here is how we did it: http://aawaara.com/post/61798959791/break-it-down

Those were time sensitive jobs, and we needed to get them out of the queue.

mattbornski commented 11 years ago

@amitsaxena have you considered using the priority column? @banderous DelayedJob has many advantages over other worker solutions for ruby. One of the things I like least about it is the interaction between running jobs and queueing jobs. Sounds like this would be an upgrade in this dimension. I'd vote for it.

amitsaxena commented 11 years ago

@mattbornski yes I use the priority column - it's just that there are so many database queries by the workers (for polling the queue, updating timestamps, host, etc.), that mysql becomes a bottleneck.

inspire22 commented 10 years ago

You might also try the delayed_job_redis (or one of the other backends).

amitsaxena commented 10 years ago

@inspire22 we have moved to sidekiq now.

xpepermint commented 9 years ago

Can we learn something from que?

albus522 commented 9 years ago

Que only works with postgresql. So nope.

fguillen commented 8 years ago

I am suffering this issue now:

I have tried to add more worker but it just make the problem worst

albus522 commented 8 years ago

You can try Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql Using that can effect performance both ways, but for some people it helps.

Also is that 280k active jobs or is that mostly old failed jobs?

If all those are active what is your average job runtime? DJ is not the ideal candidate if you are running massive numbers of very fast jobs.

If they are old failed jobs, do you really still need them?

fguillen commented 8 years ago

Also is that 280k active jobs or is that mostly old failed jobs?

All active

If all those are active what is your average job runtime?

Less than a second, but enough to want to put them out of the request.

DJ is not the ideal candidate if you are running massive numbers of very fast jobs.

Good to know

Thanks @albus522

fguillen commented 7 years ago

SO thread: http://stackoverflow.com/q/40360455/316700

DanielRussell commented 7 years ago

On MySQL at least, this appears to be due to the LIMIT 1 in several UPDATE queries. See this blog post for a bit more explanation.

I dropped my delayed_job table from ~220k to ~9k rows (lots of failures due to an application bug) and reduced MySQL CPU usage from 80% to 15%.

NikolayS commented 7 years ago

In Postgres, you can significantly speed it up just with 2 trivial steps:

1) Use SELECT ... FOR UPDATE SKIP LOCKED instead of just SELECT ... FOR UPDATE (the code is here https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb)

2) Add this index:

CREATE INDEX i_delayed_jobs_priority_partial
  ON delayed_jobs USING btree (priority, run_at, queue) WHERE failed_at IS NULL;
albus522 commented 7 years ago

SKIP LOCKED was introduced in postgres 9.5, so that feature requires a relatively new postgres install and would require us to make our code even more complicated with db version detection so things don't break for some uesrs.

Indices are probably something we need to revisit, but results have not been consistent for any suggested index update so far. They have generally worked for a users specific case but not all and this index is very specific to postgres.

NikolayS commented 7 years ago

@albus522 thank you for your response, but it was not a change request, it was advice for those who uses DJ with Postgres (I hope with 9.6 or at least 9.5, otherwise one more advice -- upgrade!).

Without those two improvements, you will have big performance issues if you need to process millions of jobs per day.

I considered to create a pull request but changed my mind when saw how many of PRs are left unprocessed for years in this repo.

joshuapinter commented 5 years ago

Indexing is definitely the way to go. Here's what ours looks like:

add_index "delayed_jobs", ["deleted_at", "locked_at", "failed_at", "queue", "run_at", "locked_by"], name: "deleted_at_locked_at_failed_at_queue_run_at_locked_by", using: :btree

With about 200,000 rows, our queries went from about 500ms to 0.7ms, a 714x improvement.

amitsaxena commented 5 years ago

@joshuapinter that's a huge composite index and I doubt if most of the fields there are really adding any speed to your queries.

It's been quite long since I moved away from delayed_job, but try having an index on just the most selective column (or may be two if one is not good enough) in the query and that should give you similar performance gains. Once you are down from say a million rows to 100 (using a selective column index), the extra columns in index don't really add much value. It just makes your write costly.

Looking at the queries above (unless they have changed in newer versions), probably multiple single column indexes make more sense for this than one huge multi column composite index.

joshuapinter commented 5 years ago

@amitsaxena You're right. I'm gonna run some quick tests there and see where I can stop and still achieve the same performance improvements.

joshuapinter commented 5 years ago

@amitsaxena Ha! You're right. I got the same performance improvement just by using a single deleted_at column in the index, which makes sense because that filters things down to single digit records instead of 200,000+.

Thanks for the tip and reminder!

amitsaxena commented 5 years ago

Glad that I could be of help! :)

NikolayS commented 5 years ago

@albus522 isn't it time to reconsider it?

SKIP LOCKED was introduced in postgres 9.5, so that feature requires a relatively new postgres install

Less than in one year, 9.5 will become the oldest community-supported version of Postgres, see https://www.postgresql.org/support/versioning/. Already now the most versions which are supported by PGDG have SKIP LOCKED feature. And MySQL also has it since version 8.

Indices are probably something we need to revisit, but results have not been consistent for any suggested index update so far.

Have you checked the index I proposed in https://github.com/collectiveidea/delayed_job/issues/581#issuecomment-289920478? I believe it is helpful if not to all cases, but to majority of them.

gcv commented 5 years ago

For MySQL 5.6, adding an index on the failed_at column helped considerably.

Cervenka commented 3 years ago

Many select queries (120 reads) on a table with under 2k entries resulted in everything coming to a crawling halt. Adding an index on delayed_jobs(priority,run_at) improved the situation. I had another index only on the column priority. I had to drop that index because it was preferred by mysql 5.7.

jairovm commented 2 years ago

I added an index on delayed_jobs(priority,run_at) and switched to the default_sql strategy, which is pretty stable now with about 1M records.

Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql
toao commented 2 years ago

I added an index on delayed_jobs(priority,run_at) and switched to the default_sql strategy, which is pretty stable now with about 1M records.

Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql

@jairovm With PostgreSQL or MySQL?

jairovm commented 2 years ago

It was MySQL, but I ended up building a hybrid between DelayedJob and Sidekiq, wrote a small rake task that grabs jobs from the delayed_jobs table in batches of 1_000, and it then enqueues their ids into Redis, which Sidekiq workers pick and process the jobs using Delayed::Worker, something like this.

namespace :jobs do
  desc 'Enqueue Delayed::Job records'
  task enqueuer: :environment do
    Rails.logger.info "Starting job:enqueuer task"

    loop do
      if Sidekiq::ProcessSet.new.size.positive?
        Delayed::Job.order(:run_at).where(run_at: ..Time.current).limit(1_000).pluck(:id).each do |id|
          Rails.logger.info "Enqueueing DelayedJob::Job: #{id}"
          DelayedJob::EnqueuerJob.perform_later(job_id: id)
        end
      else
        Rails.logger.info "No workers available, skipping delayed_job jobs for now."
      end

      sleep 10
    end
  end
end
class DelayedJob::EnqueuerJob < ActiveJob::Base
  sidekiq_options retry: 5
  queue_as :critical

  def perform(job_id:, **)
    Delayed::Job.transaction do
      return unless (job = Delayed::Job.lock.find_by(id: job_id))

      Delayed::Worker.new.run(job)
    end
  end
end

With that, you can run rake jobs:enqueuer instead of rake jobs:work task, we have about 1.5M records in our delayed_jobs table, and it's working pretty well πŸ‘ŒπŸ½ ( still a beta version of this, tho )

I hope it helps, @toao πŸ‘πŸ½

toao commented 2 years ago

@jairovm thanks for the feedback!

Meanwhile I developed a similar batch solution, but within delayed job itself. It helped for a short interim period but now dj was fully dropped and replaced with an AMQP based approach. Works much smoother, no polling and scales nicely.

kad92 commented 1 year ago

In Postgres, you can significantly speed it up just with 2 trivial steps:

  1. Use SELECT ... FOR UPDATE SKIP LOCKED instead of just SELECT ... FOR UPDATE (the code is here https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb)
  2. Add this index:
CREATE INDEX i_delayed_jobs_priority_partial
  ON delayed_jobs USING btree (priority, run_at, queue) WHERE failed_at IS NULL;

I've added the indices to the delayed_job table and I'm already seeing significant improvement! Does anyone know how to do the modifications for the first step mentioned regarding using 'SELECT ... FOR UPDATE SKIP LOCKED' ? I assume it should be done in the delayed_job initializer file in rails.

camallen commented 3 months ago

I've added the indices to the delayed_job table and I'm already seeing significant improvement! Does anyone know how to do the modifications for the first step mentioned regarding using 'SELECT ... FOR UPDATE SKIP LOCKED' ? I assume it should be done in the delayed_job initializer file in rails.

Ran into a similar issue recently where large row counts in the delayed_jobs table caused high row lock contention on the db and job processing ground to a halt. The following monkey patch in an initializer file that added the SKIP LOCKED postgresql query functionality and voila our db lock contention vanished πŸ’₯ (note we do have the index defined as well)

# typed: false
# frozen_string_literal: true
# https://github.com/collectiveidea/delayed_job/issues/581#issuecomment-289920478
# https://www.postgresql.org/docs/16/sql-select.html ('The Locking Clause' section)
module Delayed
  module Backend
    module ActiveRecord
      class Job < ::ActiveRecord::Base
        def self.reserve_with_scope_using_optimized_postgres(ready_scope, worker, now)
          quoted_name = connection.quote_table_name(table_name)
          subquery    = ready_scope.limit(1).lock("FOR UPDATE SKIP LOCKED").select("id").to_sql
          sql         = "UPDATE #{quoted_name} SET locked_at = ?, locked_by = ? WHERE id IN (#{subquery}) RETURNING *"
          reserved    = find_by_sql([sql, now, worker.name])
          reserved[0]
        end
      end
    end
  end
end

Shout out to @NikolayS for providing the fix that worked for us.

Also other postgres backed queue systems implement the SKIP LOCKED

and a good read on why using this feature is a good idea, https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/

diclophis commented 1 month ago

@camallen / @NikolayS if we could get this SKIP LOCKED reservation strategy plugged into both the postgresql and mysql adapters as an alternate reservation strategy that could make delayed_job a contender again in the rails async job ecosystem!

camallen commented 1 month ago

Looks like there has been progress on supporting this technique in the backend gem https://github.com/collectiveidea/delayed_job_active_record/pull/215