Working with @emmaoberstein on setting up a new queue, we discovered that there is a condition when the worker will get stuck in a spinloop, running the pickup query multiple times per second (effectively disobeying the sleep_delay config).
The likeliest reproduction would be to enqueue a job that fails deserialization completely, or that otherwise prevents the usual job cleanup (e.g. it crashes the thread).
Here's where the work_off method hits the spinloop:
def work_off(num = 100)
success = Concurrent::AtomicFixnum.new(0)
failure = Concurrent::AtomicFixnum.new(0)
num.times do
jobs = reserve_jobs
break if jobs.empty? # jobs is not empty
pool = Concurrent::FixedThreadPool.new(jobs.length)
jobs.each do |job|
pool.post do
# Exception encountered when `payload_object` is first called, e.g. job fails to deserialize
# - success and failure are never incremented
# - job remains in queue and is immediately returned by the next `reserve_jobs` without waiting
end
end
pool.shutdown
pool.wait_for_termination
break if stop?
end
[success, failure].map(&:value)
end
There a few ways I could think to fix this:
Add a new "inner loop" delay that sets a minimum amount of time in between each iteration of num.times do.
Bail from the loop if neither success nor failure were incremented (i.e. no work got done).
Ensure that job cleanup happens in all cases (except for complete loss of DB connection), to ensure that reserve_jobs won't immediately return the same job again (due to exponential backoff).
All of these feel fairly reasonable, though I'd be inclined to explore the second and third. (Adding a new delay would require more tuning & testing and would not actually address the underlying failure mode for the job.) So, actually, I'd want to start with the third option, since it would likely also address the remaining issue in #23.
Working with @emmaoberstein on setting up a new queue, we discovered that there is a condition when the worker will get stuck in a spinloop, running the pickup query multiple times per second (effectively disobeying the
sleep_delay
config).The likeliest reproduction would be to enqueue a job that fails deserialization completely, or that otherwise prevents the usual job cleanup (e.g. it crashes the thread).
Here's where the
work_off
method hits the spinloop:There a few ways I could think to fix this:
num.times do
.reserve_jobs
won't immediately return the same job again (due to exponential backoff).All of these feel fairly reasonable, though I'd be inclined to explore the second and third. (Adding a new delay would require more tuning & testing and would not actually address the underlying failure mode for the job.) So, actually, I'd want to start with the third option, since it would likely also address the remaining issue in #23.