amazon-archives / aws-flow-ruby

ARCHIVED
137 stars 58 forks source link

CompleteWorkflowExecutionFailed when 2 activities fail simultaneously #25

Closed jongbeau closed 10 years ago

jongbeau commented 10 years ago

Please see the link for a small repro of the issue. When running this test, the execution is timing out because the decider is hitting "CompleteWorkflowExecutionFailed" after the activity task fails. See screenshot to see what I mean. This does not necessarily happen every time, but it happens quite often (more than 50% of the time).

I've had lots of trouble handling failed activities, without breaking the decider or getting it into a bad state.

http://www.fileswap.com/dl/PchKdHpSZB/

screen shot 2013-12-03 at 5 59 31 pm

jongbeau commented 10 years ago

I've also been in touch with AWS Support:

Amazon Web Services Dec 04, 2013 01:58 AM PST Hi, I've tested your application in my environment and the fails come in when two activities fails at the same time.

I've escalated the issue to the SWF team.

Once I hear about from the team, I will get back in touch with you.

Best regards,

Javier R. Amazon Web Services

mjsteger commented 10 years ago

Looking into it now. Thanks for the clear repro!

mjsteger commented 10 years ago

I have a fix which solves the immediate problem(i.e. I was able to run the repro you gave 40x without any timeouts), we'll run it through testing/code review and try to get it out as soon as possible.

jongbeau commented 10 years ago

Thank you sir! Please let me know when the fix is available for use.

jongbeau commented 10 years ago

so please forgive my ignorance, but how should I get the latest update? I'm using the aws-flow gem, do I need to build it from source, or should I update the gem?

mjsteger commented 10 years ago

If you are installing gems directly, you can issue a

gem update aws-flow

to get the latest version, which includes the changes which fixed this issue.

jongbeau commented 10 years ago

Thanks, I pulled the latest version. It seems to be fixed in my bare bones test, but not in my real application. The main difference is that in my test application, I'm calling child workflows instead of activities. When multiple child workflows fail, I'm getting ChildWorkflowExecutionFailed status. Here is a stack trace:

--- !ruby/exception:StandardError message: failing execution due to failed BCP activity --- - ff_workflow.rb:58:in `block (2 levels) in ff_workflow' - '------ continuation ------' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/workflow_definition.rb:50:in `block (2 levels) in execute' - '------ continuation ------' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:393:in `block in handle_workflow_execution_started' - '------ continuation ------' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:389:in `handle_workflow_execution_started' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:655:in `process_event' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:253:in `block in decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `each' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:227:in `decide' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_handler.rb:47:in `handle_decision_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_poller.rb:65:in `poll_and_process_single_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:199:in `run_once' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:185:in `block in start' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `loop' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `start' - ff_workflow.rb:76:in `<main>' - '------ continuation ------' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:389:in `new' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:389:in `handle_workflow_execution_started' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:655:in `process_event' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:253:in `block in decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `each' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:227:in `decide' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_handler.rb:47:in `handle_decision_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_poller.rb:65:in `poll_and_process_single_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:199:in `run_once' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:185:in `block in start' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `loop' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `start' - ff_workflow.rb:76:in `<main>' - '------ continuation ------' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:393:in `block in handle_workflow_execution_started' - '------ continuation ------' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:389:in `handle_workflow_execution_started' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:655:in `process_event' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:253:in `block in decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `each' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:227:in `decide' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_handler.rb:47:in `handle_decision_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_poller.rb:65:in `poll_and_process_single_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:199:in `run_once' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:185:in `block in start' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `loop' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `start' - ff_workflow.rb:76:in `<main>' - '------ continuation ------' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:389:in `new' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:389:in `handle_workflow_execution_started' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:655:in `process_event' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:253:in `block in decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `each' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:249:in `decide_impl' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/async_decider.rb:227:in `decide' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_handler.rb:47:in `handle_decision_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/task_poller.rb:65:in `poll_and_process_single_task' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:199:in `run_once' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:185:in `block in start' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `loop' - /home/jason/.rvm/gems/ruby-2.0.0-p247/gems/aws-flow-1.0.6/lib/aws/decider/worker.rb:184:in `start' - ff_workflow.rb:76:in `<main>'
jongbeau commented 10 years ago

Can we reopen this issue?

mjsteger commented 10 years ago

I'm not sure exactly what the problem is, can you explain further? Is the workflow hitting ChildWorkflowExecutionFailed and then timing out in the same way that CompleteWorkflowExecutionFailed did?

mjsteger commented 10 years ago

Ping. If I can get the workflow definition and sample history, I can work on writing up a repro.

mjsteger commented 10 years ago

I'm going to close this issue to clean up, but feel free to re-open. To help you further I'll need a workflow definition and sample history.