Allow shutdown to be overridden with Lambda

brandonhilkert commented 8 years ago

@jdantonio - Can you suggest a way to allow currently processing jobs to complete, but not have the workers pick up new jobs?

I think I might be a little unsure of what shutdown will do vs. kill. Using this script, both seem to exit immediately.

jdantonio commented 8 years ago

You'll have to manage in-queue jobs within Sucker Punch.

The difference between #kill and shutdown is how they handle thread exit. Neither method allows jobs in the queue to be processed. Decisions about the queue are deliberately left to the application. There are too many possible states too many possible options for them to be effectively handled by the thread pool itself.

The difference between the two functions is that #kill attempts a language-level kill of every thread in the pool, regardless of its operating state. The #shutdown method allows each thread to finish its current job but instructs them to not grab any more jobs from the queue. The risk of #kill is that threads may be killed while the system is in an inconsistent or intermediate state. If the thread is connected to external resources the connections may be immediately severed. The advantage of #kill is that the pool should stop very quickly. The risk of the #shutdown method is that one or more long-running jobs may "hang" the shutdown process for an indeterminate amount of time. There is no way to know when those jobs may complete. So #shutdown should never leave the system in an inconsistent state but shutdown may take a long (and nondeterministic) time.

The most thorough practice for shutting down a thread pool is to combine the two methods:

Call #shutdown to start an orderly cessation of processing
Call #wait_for_termination with a reasonable timeout
If the wait succeed then you're done--the pool has stopped gracefully
If the wait fails, call #kill to brutally kill the stragglers
- Then call #wait_for_termination again with a shorter timeout to let the kill finish

To manage the remaining enqueued jobs I recommend setting a global on/off flag and have your job posting function check this flag. Combine this with a special "stop" job to indicate that the thread pools are done.

During normal operation the flag is "on" and jobs process normally
When Sucker Punch begins its shutdown it sets the flag to "off"
- Additionally, a special "stop" job is post to each thread pool
- It's the last job that will be processed
When new jobs are submit the post function checks the on/off flag and rejects the job
Meanwhile, the shutdown process waits for the "stop" jobs to complete
- The "stop" job is a simple job that simply counts down a CountDownLatch
- The shutdown process waits on the latch to reach zero
- The same latch can be passed to every thread pool; the initial value of the latch is the total number of thread pools
If waiting on the latch succeeds then you know all enqueued jobs have processed and it's safe to exit the application
If waiting fails then you can proceed to shutdown/kill the thread pools as you see fit

jdantonio commented 8 years ago

For reference, a couple of places w/i concurrent-ruby where we use flags and "special" jobs for managing the underlying thread pool:

Canceling a Future (here)
Waiting for an Agent job to complete (here and here)

jdantonio commented 8 years ago

My last comment I promise! :-)

The reason these issues didn't come up with Celluloid is that it's a very opinionated framework. The implementers had to deal with these issues, too. But to make life simpler for their users they decided what to do in these cases and implemented it that way. This is great so long as everything works properly, but when it doesn't you are at the mercy of Celluloid (#132, #135, etc.).

Conversely, concurrent-ruby is deliberately unopinionated. We ask our users to make these decisions for themselves and provide the tools necessary to implement your application the way you want. Plus, ThreadPoolExecutor is a low-level abstraction (much lower lever than Concurrent::Future, Concurrent::Actor, Concurrent::Agent, etc.). So there is a trade-off.

But it's important to understand that these issues exist when shutting down any concurrent application written in any language. The only real difference is the toolset and the options it gives you.

brandonhilkert commented 8 years ago

@jdantonio Makes sense. Assuming we want to allow the following:

# Currently running jobs are allowed to complete, but queued jobs are discarded
SuckerPunch.shutdown_mode = :soft # DEFAULT

# All jobs are terminated immediately (both currently running and queued)
SuckerPunch.shutdown_mode = :hard

# Shutdown is blocked until both running and queued jobs complete
SuckerPunch.shutdown_mode = :none

If I understand correctly, this maps to like:

:soft => pool.shutdown :hard => pool.kill :none => pool.wait_for_termination

I hear everything you're saying about giving the pools opportunities to shutdown and sending multiple signals to it. Just want to provide the basics for now. Do you see any issue with the above?

When I ran the script in the none mode (wait_for_termination), when it exhausted the queue, the following was triggered:

Executing job 499
Executing job 498
/Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/synchronization/mri_lockable_object.rb:43:in `sleep': No live threads left. Deadlock? (fatal)
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/synchronization/mri_lockable_object.rb:43:in `wait'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/synchronization/mri_lockable_object.rb:43:in `ns_wait'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/synchronization/abstract_lockable_object.rb:43:in `ns_wait_until'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/atomic/event.rb:67:in `block in wait'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/synchronization/mri_lockable_object.rb:38:in `block in synchronize'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/synchronization/mri_lockable_object.rb:38:in `synchronize'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/synchronization/mri_lockable_object.rb:38:in `synchronize'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/atomic/event.rb:64:in `wait'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/executor/ruby_executor_service.rb:49:in `wait_for_termination'
    from /Users/bhilkert/Dropbox/code/sucker_punch/lib/sucker_punch/shutdown_mode.rb:21:in `shutdown'
    from /Users/bhilkert/Dropbox/code/sucker_punch/lib/sucker_punch/job.rb:67:in `shutdown'
    from /Users/bhilkert/Dropbox/code/sucker_punch/lib/sucker_punch/job.rb:26:in `block in included'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/utility/at_exit.rb:72:in `call'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/utility/at_exit.rb:72:in `block in run'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/utility/at_exit.rb:70:in `each'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/utility/at_exit.rb:70:in `run'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/utility/at_exit.rb:88:in `runner'
    from /Users/bhilkert/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/concurrent-ruby-1.0.0/lib/concurrent/utility/at_exit.rb:49:in `block (2 levels) in install'

Thoughts?

jdantonio commented 8 years ago

A call to #wait_for_termination must occur after a call to either #kill or #shutdown. PR #147 is a spike of what I was trying to describe above. It needs some work to be production-ready, but I think (hope) it demonstrates the ideas. It can be run against the bin/shutdown script as demonstration.

brandonhilkert commented 8 years ago

@jdantonio Thanks so much for the example. Seeing code first hand always help me wrap my head around it.

I want to take a step back from what I was doing with the modes. Part of why I think Sucker Punch is valuable to the community, especially those with little/no multi-threading experience is it removes these hairy details. Prior to seeing the Celluloid output about not shutting down properly, there wasn't a single question about how threads exited and the effect of those jobs currently in-process.

With that in mind, I'd like to provide a sane default, while also allowing those more interested in providing a custom shutdown routine. That'll keep the details away from those just looking to do simple logging or sending emails (2 most common use cases). For those with a more robust use case, they'll have the opportunity to configure any one of the modes previously discussed.

Having said all that, i'd like the default to be something like what I previously described as the soft mode, but then exiting after 10 sec. 10 sec. is the time frame heroku uses to forcefully shutdown the application and I have a feeling Heroku is the most common host of Sucker Punch, given it's use in saving money on an additional worker dyno. I spiked out something like what you described, however when I run the bin/shutdown script, new jobs continue to be picked up even after the shutdown method is sent to the pool. I didn't find that to be the case before when using AtExit. Is there something different about this implementation that would cause that?

Also, with the 10 sec. thing in mind, I don't think I can iterate through the queues and call wait_for_termination given that the last queue in Map would be cut off before it received the method if Heroku cut the process off (similar to your comment here).

Also, out of curiosity, what's the difference in using at_exit vs. Concurrent::AtExit.add?

jdantonio commented 8 years ago

:+1: to setting reasonable defaults. That's what I've always tried to do with c-r. That's why we have at_exit handlers, global thread pools, an internal monotonic clock, etc. Most users can simply create futures, promises, actors, etc. and not worry about the details. Advanced users have much greater control.

Concurrent::AtExit is an internal implementation detail that I probably didn't need to mention. Under the hood it just uses at_exit. The problems with at_exit are 1) the order the handlers are created is important and 2) once you call it you can't change it. Internally we use AtExit to create a registry of procs that we then run from within a single at_exit handler. This gives us greater control internally. In your case you probably only need the one handler and the order it runs with respect to handlers registered by other gems is probably irrelevant. So in your case a single at_exit call at the end of your initialization routine is probably sufficient--no need to delegate it back to c-r.

I'll try to take a look at your code as soon as possible. I'm heading out of town tomorrow evening for a conference so this will likely be a very busy week. My apologies in advance if I don't get to this right away. I'll do my best to help you get 2.0 out while Rails 5 is still in beta. It would be awesome to see Rails 5 able to use the updated Sucker Punch!

jdantonio commented 8 years ago

PS: I created the new built-in `:async' job processor and queue adapter in Rails 5 so I'm very familiar with how queue adapters work. Once 2.0 is ready I'm happy to help you update the sucker_punch adapter.

brandonhilkert commented 8 years ago

Awesome! Reading through the async adapter was one of the reasons I moved forward converting sucker punch to c-r :)

On Sunday, January 3, 2016, jdantonio notifications@github.com wrote:

PS: I created the new built-in `:async' job processor and queue adapter https://github.com/rails/rails/pull/21257 in Rails 5 so I'm very familiar with how queue adapters work. Once 2.0 is ready I'm happy to help you update the sucker_punch https://github.com/rails/rails/blob/master/activejob/lib/active_job/queue_adapters/sucker_punch_adapter.rb adapter.

— Reply to this email directly or view it on GitHub https://github.com/brandonhilkert/sucker_punch/pull/146#issuecomment-168548956 .

_Build a Ruby Gem is available! http://brandonhilkert.com/books/build-a-ruby-gem/?utm_source=gmail-sig&utm_medium=email&utm_campaign=gmail_

http://brandonhilkert.com

brandonhilkert commented 8 years ago

@jdantonio Going to merge this into the concurrent branch for now. Would love to continue the conversation on that PR

brandonhilkert / sucker_punch

Allow shutdown to be overridden with Lambda #146