Jobs in queue aren't processed before exit

eileencodes commented 8 years ago

We use SuckerPunch locally in dev for running jobs. This works great for jobs created by the application and for jobs run in the console. But I noticed an issue.

I needed to test a script I wrote to perform an action on a bunch of records. I was testing it with Sucker Punch and noticed that when the script finishes queueing up jobs but when the script exits the queue shuts down and never finishes the queued jobs.

I made a test app to demonstrate: https://github.com/eileencodes/suckerpunch_test. The number of Users the test creates varies based on time before the script exits, but won't create all 100 users because it exits around User 28-35 (depending on how fast your db is).

I didn't think this was the same behavior I had seen in the past and found a point in SuckerPunch where I saw all the jobs finishing. It was introduced in this PR https://github.com/brandonhilkert/sucker_punch/pull/148 from @jdantonio

Looking at that it appears that SuckerPunch should be doing a graceful shutdown and finishing the jobs, but I'm not sure if that's the case. Let me know if exiting and not finishing jobs is expected behavior. I've looked at the code and tried a few changes but haven't been able to figure out how to keep the queue from shutting down until the jobs finish.

Let me know if you need more information. Thanks! :smile:

brandonhilkert commented 8 years ago

Hey Eileen!

Currently, the "shutdown mode" gives a 8 sec. grace period (Heroku forcefully kills after 10 sec) to attempt to finish the jobs in the queued and if not complete after those 8 sec, kills the workers aggressively, so the behavior you're seeing sounds consistent with that.

I added SuckerPunch.shutdown_timeout to be the time to wait to try to finish until forcefully shutting down. I'd say to put that at a really high value, but I'm testing locally and it doesn't seem to be doing what I expect.

A bit of history...

Sucker Punch was my answer to not spinning up another dyno on Heroku to use something like Sidekiq. So in short, it was a money-saving tactic. I think most users tends to use it for hobby apps and the like. Basecamp projects are far from that. So the existing behavior probably works really well for most.

When I was thinking about Sucker Punch v2, I originally put in the ability to configure the shutdown mode. One of the options was "wait until the queues are exhausted and then shutdown", which sounds like what you're going for. The resulting code was pretty complex and it worried me that the "other" modes wouldn't be used 99% of the time. But here we are... ;)

As a related side note, I keep some scripts around for testing this...

#!/usr/bin/env ruby

require 'bundler'
Bundler.require(:default)

require_relative '../lib/sucker_punch'

class CountdownJob
  include SuckerPunch::Job

  def perform(i)
    sleep 0.1
    print "Executing job #{i}\n"
  end
end

puts "Enqueuing 100 jobs..."

100.times { |i| CountdownJob.perform_async(i) }
sleep 0.5

You'll never see the counter get to 100 b/c of the forceful shutdown after 8 sec. If the shutdown timeout operating properly, this wouldn't be a problem.

So where do we go from here?

I'll work to get the shutdown_timeout fixed up.

Alternatively, without any modification of Sucker Punch, one option would be introduce some kind of countdown latch, similar to Golang's WaitGroup, where you'd increment for each job enqueued, and then defer to the wait group to decide when to stop the script, the downside being you'd have to pass that in to your job code and decrement the group upon successfully completion of the job. Concurrent Ruby has a built-in countdown latch. I've used this extensively in testing because it saves you from putting an arbitrary amount of sleep at the end of the script, hoping you calculated the time right, but trying not to overestimate at the same time, which is gross.

Here's a naive example:

class CreateUserJob
  include SuckerPunch::Job

  def perform(n, latch = nil)
    User.create!(name: "name #{n}")
    latch.count_down if latch
  end
end

number_of_users = 100
latch = Concurrent::CountDownLatch.new(number_of_users)

number_of_users.each do |n|
  CreateUserJob.perform_async(n, latch)
end

latch.wait(1000) # Longest will be 1000, but if all jobs countdown the latch, it'll exit when all jobs are done

brandonhilkert commented 8 years ago

Issue to fix up shutdown_timeout https://github.com/brandonhilkert/sucker_punch/issues/174

eileencodes commented 8 years ago

Thanks for the quick response @brandonhilkert! :)

We only use SuckerPunch in dev so if this is how it should behave for most users than I totally get it. Adding complicated code just for us to use it in dev doesn't make sense. But I wanted to make sure this wasn't a bug.

Adding a countdown latch to the script doesn't work for us because we don't use SuckerPunch in production. I was testing my production script locally for syntax errors and the like but noticed it was just stopping after performing 2 jobs when it should have been performing 7.

Thanks for opening the countdown latch issue!

jdantonio commented 8 years ago

Possibly a problem with thread pool auto termination. I'll look at it this weekend. More detail here.

aried3r commented 8 years ago

@brandonhilkert, I understand how the latch works, but what I'm missing is how this would help with the shutdown_timeout. Is your naive example assuming that shutdown_timeout is set to a very high value, expecting the latch to exit beforehand?

brandonhilkert commented 8 years ago

@aried3r Yup. This has since been fixed.

brandonhilkert / sucker_punch

Jobs in queue aren't processed before exit #173