amazon-archives / aws-flow-ruby

ARCHIVED
136 stars 57 forks source link

rescue StandardError instead of Exception & support ruby 2.0 #29

Open brightball opened 10 years ago

brightball commented 10 years ago

I'm trying to use the Flow framework with Foreman, which just sends a SIGINT to the running process.

I'd like to be able to do something like this:

if __FILE__ == $0
  begin
    worker.start
  rescue SystemExit, Interrupt
    worker.stop # Or some appropriate message that will actually shut things down
    raise
  end
end

Right now, whenever I press CTRL+C with Foreman it's sending a SIGINT which results in this error:

─$ foreman start
11:58:56 decider.1 | started with pid 17132
^CSIGINT received
/usr/local/foreman/lib/foreman/engine.rb:226:in `synchronize': can't be called from trap context (ThreadError)
    from /usr/local/foreman/lib/foreman/engine.rb:226:in `output_with_mutex'
    from /usr/local/foreman/lib/foreman/engine.rb:232:in `system'
    from /usr/local/foreman/lib/foreman/engine.rb:304:in `terminate_gracefully'
    from /usr/local/foreman/lib/foreman/engine.rb:41:in `block in start'
    from /usr/local/foreman/lib/foreman/engine.rb:289:in `call'
    from /usr/local/foreman/lib/foreman/engine.rb:289:in `wait2'
    from /usr/local/foreman/lib/foreman/engine.rb:289:in `watch_for_termination'
    from /usr/local/foreman/lib/foreman/engine.rb:48:in `start'
    from /usr/local/foreman/lib/foreman/cli.rb:40:in `start'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor/task.rb:27:in `run'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor/invocation.rb:120:in `invoke_task'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor.rb:275:in `dispatch'
    from /usr/local/foreman/vendor/gems/thor-0.16.0/lib/thor/base.rb:425:in `start'
    from /usr/bin/foreman:15:in `<main>'
mjsteger commented 10 years ago

At first glance, looks like an issue with using trap context in a way that is deadlockable(see this). In theory flow workers are supposed to have a graceful shutdown mechanic, but I can see how it would be problematic in 2.0 if they have deadlockable detection on traps. Can you confirm that you are getting this error while running ruby >= 2.0 ?

brightball commented 10 years ago

It is running 2.1.0 actually.

mjsteger commented 10 years ago

Still working on this, sorry for the slow progress.

marcbowes commented 10 years ago

FWIW, the flow framework does rescue Exception everywhere which blocks signals regardless of the deadlock issue you describe. You should consider replacing all of these with less aggressive rescues.

https://github.com/aws/aws-flow-ruby/blob/master/aws-flow/lib/aws/decider/task_poller.rb#L259 is a big culprit.

mjsteger commented 10 years ago

This is something we flagged pretty early on, but we unfortunately can't change the behavior without doing a major version bump. We definitely are planning to replace all the rescue Exception with something less aggressive like rescue StandardError. This may take a bit of time, as we will want to package together reverse-incompatible changes. I'll get together a list of the changes we are planning on making that may be reverse-incompatible and add them to the repository.

justincampbell commented 10 years ago

IMO, rescue Exception is a bug and not a backwards-incompatible change. There should not be code depending on this behavior.

kinsersh commented 10 years ago

Any ETA on this being resolved?

pmohan6 commented 10 years ago

Changing the title to better reflect the current issue.

1) The original issue was that ruby 2.0 doesn't allow any locking (rightly so) in trap contexts. While we don't explicitly acquire any locks, we do initiate shutdown of workers when we receive signals. This forces Flow to log some stuff. Logger internally tries to lock on a mutex and it fails.

2) Workers do indeed shutdown when they receive signals just not in the most graceful way. i.e. they don't handle the long polling behavior of SWF well. Refer to #31 to get more context.

3) We need to rescue StandardError instead of Exception.

@kinsersh can you tell me what issue you were referring to?

kinsersh commented 10 years ago

I was referring to number 3 in your list.