Ruby process quits before Sucker Punch finishes

bdewater commented 7 years ago

Hi! I've run into an interesting issue while upgrading a Rails app from Sucker Punch 1.x to 2.0. The tl;dr is that during a staging test Airbrake's (4.3.8) test rake task did not send a notification. After investigation it seems the Ruby process quits while the job isn't finished.

The output with SP 1.2:

$ bundle exec rake airbrake:test
Started GET "/verify" for  at 2017-07-27 09:46:35 +0000
Raising 'AirbrakeTestingException' to simulate application failure.

Testing airbrake via "rake airbrake:test". If you can see this, it works.
** [Airbrake] Success: Net::HTTPOK
** [Airbrake] Environment Info: [Ruby: 2.3.1] [Rails: 4.2.7.1] [Env: staging]
** [Airbrake] Response from Airbrake: 
UUID: ✂️
URL: ✂️

With 2.0:

Started GET "/verify" for 127.0.0.1 at 2017-07-27 08:55:22 +0000
Raising 'AirbrakeTestingException' to simulate application failure.

Testing airbrake via "rake airbrake:test". If you can see this, it works.

First I copied over AirBrake's SendJob with a different name to the app directory so I could stick a binding.pry in there. With a step and a few next calls it ended up sender.rb on line 54 just before doing the HTTP post where it mysteriously exited.

I ran it again with byebug --no-quit for post-mortem debugging and all of a sudden the POST went through just fine:

$ byebug --no-quit bundle exec rake airbrake:test

[4, 13] in /Users/bartdewater/.gem/ruby/2.3.4/bin/bundle
    4: #
    5: # The application 'bundler' is installed as part of a gem, and
    6: # this file is here to facilitate running it.
    7: #
    8:
=>  9: require 'rubygems'
   10:
   11: version = ">= 0.a"
   12:
   13: if ARGV.first
(byebug) continue
✂️

Started GET "/verify" for 127.0.0.1 at 2017-07-27 15:44:50 -0400
Raising 'AirbrakeTestingException' to simulate application failure.

Testing airbrake via "rake airbrake:test". If you can see this, it works.
exit
["/Users/bartdewater/.gem/ruby/2.3.4/gems/bundler-1.15.3/lib/bundler/friendly_errors.rb:124:in `exit'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/bundler-1.15.3/lib/bundler/friendly_errors.rb:124:in `rescue in with_friendly_errors'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/bundler-1.15.3/lib/bundler/friendly_errors.rb:121:in `with_friendly_errors'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/bundler-1.15.3/exe/bundle:22:in `<top (required)>'", "/Users/bartdewater/.gem/ruby/2.3.4/bin/bundle:22:in `load'", "/Users/bartdewater/.gem/ruby/2.3.4/bin/bundle:22:in `<top (required)>'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/byebug-9.0.6/lib/byebug/runner.rb:174:in `debug_load'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/byebug-9.0.6/lib/byebug/runner.rb:174:in `debug_program'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/byebug-9.0.6/lib/byebug/runner.rb:93:in `block in run'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/byebug-9.0.6/lib/byebug/runner.rb:92:in `loop'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/byebug-9.0.6/lib/byebug/runner.rb:92:in `run'", "/Users/bartdewater/.gem/ruby/2.3.4/gems/byebug-9.0.6/bin/byebug:7:in `<top (required)>'", "/Users/bartdewater/.gem/ruby/2.3.4/bin/byebug:22:in `load'", "/Users/bartdewater/.gem/ruby/2.3.4/bin/byebug:22:in `<main>'"]
(byebug:ctrl) ** [Airbrake] Success: Net::HTTPOK
** [Airbrake] Environment Info: [Ruby: 2.3.4] [Rails: 4.2.7.1] [Env: development]
** [Airbrake] Response from Airbrake:
UUID: ✂️
URL:  ✂️
quit
Really quit? (y/n) y

The (byebug:ctrl) prompt prefix indicates execution has ended (docs). A split second after displaying that prompt the [Airbrake] Success: Net::HTTPOK message is outputted, so that's why the quit command is not prefixed by the prompt.

Trying to confirm my theory the process is ended too fast, I reconfigured Airbrake as such:

Airbrake.configure do |config|
    # ...
    config.async = lambda { |notice| SendJob.perform_async(notice); sleep 1 }
  end

and it works as expected.

In https://github.com/brandonhilkert/sucker_punch/issues/188#issue-191922704 it is suggested to run jobs inline from rake tasks (which works too for my case) but I was under a similar impression SP would wait a little (shutdown_timeout seconds) while before quitting?

brandonhilkert commented 7 years ago

hey @bdewater - sorry for the lack of response. Been down for the count with the stomach flu. I plan to take a look today. Will be in touch 👍

brandonhilkert commented 7 years ago

@bdewater setting

workers 1

in the SendJob should fix it up. Let me know.

@jdantonio Wondering if you could chime in...

Sucker Punch Queues default to 2 workers. I think most people leave it as is, but anything over 1 would be susceptible to an issue attempting to empty the queues upon shutdown.

When the CountDownLatch is posted to each queue, it's pushed to the end of the queue. This will only properly wait to all other jobs before it have completed if there is 1 worker for each queue. Let's say there are 2 workers in a queue and 1 job is currently running, but yet to complete (let's say there's a 10 sec. sleep in it), the CountDownLatch is posted to the queue and picked up by the idle worker (2nd worker) and executed immediately.

I think if a queue has 100 jobs in it and 2 workers running, it would try to exhaust all of them (assuming they execute within the shutdown_timeout) and the worst case is the last job will be cut off.

I'm wondering if there's a better strategy to make sure the CountDownLatch job isn't executed until ALL other jobs have finished, even run by other workers. First thought was to adjust the number of workers on the fly to 1 and then post the job, but I'm unsure what would happen to any active workers when that call is made.

The following code exits immediately:

require 'sucker_punch'

class FakeJob
  include SuckerPunch::Job

  def perform
    sleep 2
    puts "inside FakeJob"
  end
end

FakeJob.perform_async

[02:13:48] bhilkert [~/Desktop] $ ruby sp.rb

The following code will properly wait b/c the latch is run AFTER the enqueue job:

require 'sucker_punch'

class FakeJob
  include SuckerPunch::Job
  workers 1

  def perform
    sleep 2
    puts "inside FakeJob"
  end
end

FakeJob.perform_async

[02:13:58] bhilkert [~/Desktop] $ ruby sp.rb
inside FakeJob

brandonhilkert commented 7 years ago

@bdewater Can you try the alt-shutdown branch and let me know if that works as expected?

@jdantonio / @pitr-ch - Do you think this is a reasonable approach? https://github.com/brandonhilkert/sucker_punch/commit/70aa7bd0cbc7e57aa328fdfb33801339f1cb27e1

jdantonio commented 7 years ago

@brandonhilkert I apologize for the late response, I've been out of town for work.

This will only properly wait to all other jobs before it have completed if there is 1 worker for each queue.

As soon as I saw this I realized you were correct. This is a legit bug. :-( I haven't had a chance to look at the suggested fix yet, but your description sounds solid. I'll look at it more deeply once I get home.

brandonhilkert commented 7 years ago

Fixed in 2.0.3

bdewater commented 7 years ago

Thanks everybody, and sorry for my own delay in responding (vacation). I've tried 2.0.3 in the Rails app and it works! 🙇

@jdantonio do you mean a bug in concurrent-ruby and if so, do you want me to open an issue?

jdantonio commented 7 years ago

@bdewater No, a bug here. I implemented the shutdown code in sucker_punch when @brandonhilkert moved to c-r. My algorithm here was fundamentally wrong, as you figured out. 😊

brandonhilkert / sucker_punch

Ruby process quits before Sucker Punch finishes #203