bensheldon / good_job

Multithreaded, Postgres-based, Active Job backend for Ruby on Rails.
https://goodjob-demo.herokuapp.com/
MIT License
2.67k stars 199 forks source link

Avoid querying through `serialized_params` #876

Open ollym opened 1 year ago

ollym commented 1 year ago

As we've scaled up with good_jobs we've started to feel significant pain particularly when the arguments of a job are large, the performance of queries that query through serialized_params can slow down dramatically.

I think you should add job_class and executions as columns and refactor any code that queries their values to not use serialized_params.

Is there any reason why you don't?

ollym commented 1 year ago

A quick monkey patch for anyone interested and having the same issues:

# db/migrate/....rb
add_column :good_jobs, :job_class, :string
add_column :good_jobs, :executions, :integer, default: 0
# config/initializers/good_job.rb
Rails.application.reloader.to_prepare do
  Rails.application.configure do
    # ... good job configuration here
  end

  module GoodJob::JobExtensions
    extend ActiveSupport::Concern

    prepended do
      scope :job_class, ->(job_class) { where(job_class: job_class) }

      # First execution will run in the future
      scope :scheduled, -> { where(finished_at: nil).where('COALESCE(scheduled_at, created_at) > ?', DateTime.current).where(arel_table[:executions].lt(2)) }

      # Execution errored, will run in the future
      scope :retried, -> { where(finished_at: nil).where('COALESCE(scheduled_at, created_at) > ?', DateTime.current).where(arel_table[:executions].gt(1)) }
    end

    def job_class
      read_attribute(:job_class)
    end

    def serialized_params=(serialized_params)
      self.job_class = serialized_params['job_class']
      self.executions = serialized_params.fetch('executions', 0)
      super
    end
  end
  GoodJob::Job.prepend(GoodJob::JobExtensions)

  module GoodJob::ExecutionExtensions
    extend ActiveSupport::Concern

    prepended do
      scope :job_class, ->(job_class) { where(job_class: job_class) }
    end

    def serialized_params=(serialized_params)
      self.job_class = serialized_params['job_class']
      self.executions = serialized_params.fetch('executions', 0)
      super
    end
  end
  GoodJob::Execution.prepend(GoodJob::ExecutionExtensions)

  module GoodJob::BaseFilterExtensions
    def job_classes
       filtered_query(params.slice(:queue_name)).unscope(:select)
                                                .group(:job_class).count
                                                .sort_by { |name, _count| name.to_s }
                                                .to_h
     end
  end
  GoodJob::BaseFilter.prepend(GoodJob::BaseFilterExtensions)
end
bensheldon commented 1 year ago

Ooh, that's interesting!

Is there any reason why you don't?

The reason upfront: the Dashboard is a work-in-progress and very YAGNI. So I appreciate you sharing: yes, you need it.

That's interesting that grabbing the values out of the JSONB column is that much slower than individual columns. I wouldn't have expected that. My intuition has been only to pull values out of serialized_params when they need to be indexed (although I could index them within jsonb, but that seems too extra), so that's good to know that I may need to adjust my intuitions.

This work is definitely doable and should be done. It'll be a little tricky to make sure that it can be done without requiring the database migration be run until the next major release, but that's a common problem I've been handling when these things come up. Here's an example:

https://github.com/bensheldon/good_job/blob/9264e5b842405ae02ca1f0fb8473a3e988683cd2/lib/good_job/job.rb#L81-L88

It might take me a minute to get to it. Is this something you'd be able to submit a PR for?

ollym commented 1 year ago

Yea we migrated our webhook backend to good_job queue and so the arguments can be large (full request body) and that combined with 50k scheduled jobs meant things were grinding to a halt. We tried adding a btree index on serialised_params->>"job_class" but weirdly it has no effect.

It makes sense because I assume even to get a single value out of the json object it still has to parse the full body. We must have been hitting a memory limit or something.

I'll see what we can do about a PR

ollym commented 1 year ago

For others with the same issue, thanks to https://github.com/bensheldon/good_job/pull/894 (since good_job v3.15) the patch is significantly simpler, the following would work now:

# db/migrate/....rb
add_column :good_jobs, :job_class, :string
add_column :good_jobs, :executions, :integer, default: 0
# config/initializers/good_job.rb
Rails.application.reloader.to_prepare do
  module GoodJob::BaseExecutionExtensions
    extend ActiveSupport::Concern

    class_methods do
      def params_job_class
        arel_table[:job_class]
      end

      def params_execution_count
        arel_table[:executions]
      end
    end

    def job_class
      read_attribute(:job_class)
    end

    def serialized_params=(serialized_params)
      self.job_class = serialized_params['job_class']
      self.executions = serialized_params.fetch('executions', 0)
      super
    end
  end

  GoodJob::BaseExecution.prepend(GoodJob::BaseExecutionExtensions)
end