Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day
MIT License
154 stars 8 forks source link

Possible unthrottled spinloop when jobs fail to deserailize #41

Closed smudge closed 3 months ago

smudge commented 3 months ago

Working with @emmaoberstein on setting up a new queue, we discovered that there is a condition when the worker will get stuck in a spinloop, running the pickup query multiple times per second (effectively disobeying the sleep_delay config).

The likeliest reproduction would be to enqueue a job that fails deserialization completely, or that otherwise prevents the usual job cleanup (e.g. it crashes the thread).

Here's where the work_off method hits the spinloop:

    def work_off(num = 100)
      success = Concurrent::AtomicFixnum.new(0)
      failure = Concurrent::AtomicFixnum.new(0)

      num.times do
        jobs = reserve_jobs
        break if jobs.empty? # jobs is not empty

        pool = Concurrent::FixedThreadPool.new(jobs.length)
        jobs.each do |job|
          pool.post do
            # Exception encountered when `payload_object` is first called, e.g. job fails to deserialize
            # - success and failure are never incremented
            # - job remains in queue and is immediately returned by the next `reserve_jobs` without waiting
          end
        end

        pool.shutdown
        pool.wait_for_termination

        break if stop?
      end

      [success, failure].map(&:value)
    end

There a few ways I could think to fix this:

  1. Add a new "inner loop" delay that sets a minimum amount of time in between each iteration of num.times do.
  2. Bail from the loop if neither success nor failure were incremented (i.e. no work got done).
  3. Ensure that job cleanup happens in all cases (except for complete loss of DB connection), to ensure that reserve_jobs won't immediately return the same job again (due to exponential backoff).

All of these feel fairly reasonable, though I'd be inclined to explore the second and third. (Adding a new delay would require more tuning & testing and would not actually address the underlying failure mode for the job.) So, actually, I'd want to start with the third option, since it would likely also address the remaining issue in #23.

smudge commented 3 months ago

Should be resolved by #42