amazon-archives / aws-flow-ruby

ARCHIVED
137 stars 58 forks source link

Question on illegal transition errors #19

Closed arkadiyt closed 10 years ago

arkadiyt commented 10 years ago

We've recently started getting the following error:

"This is not a legal transition, attempting to consume handle_initiated_event while at state created"

/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/state_machines.rb:72:in `consume'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/async_decider.rb:378:in `handle_activity_task_scheduled'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/async_decider.rb:654:in `process_event'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/async_decider.rb:252:in `block in decide_impl'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/async_decider.rb:248:in `each'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/async_decider.rb:248:in `decide_impl'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/async_decider.rb:226:in `decide'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/task_handler.rb:47:in `handle_decision_task'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/task_poller.rb:65:in `poll_and_process_single_task'
/mnt/rails/shared/bundle/ruby/2.0.0/gems/aws-flow-1.0.5/lib/aws/decider/worker.rb:199:in `run_once'

I've been trying to dig down and figure out what sequence of history events causes the decider to get into this state but can't reproduce it - I was wondering if you might have any ideas from the stack trace and error message?

I thought we might get an error like this if we started to have the decider depend on external state and perform different actions throughout the replays, but our decider always performs the exact same steps given a particular input.

mjsteger commented 10 years ago

That error indicates that our internal state machines have gotten into some illegal state - in this case, that the activity state machine was in the created state, and attempted to consume the symbol handle_initiated_event, which it not legal at that point. A quick glance through the code doesn't point to any obvious reasons this could be getting raised, but I'd be very interested in a repro. One thing that does come to mind, is it possible that you are scheduling more than ~100 activities at a time? Because we only schedule 99 activities at a time, the code has some logic wherein it attempts to only transition ~100 state machines from the created state to the decision_sent state(at which handle_initiated_event would be a valid transition). If there was some bug in that logic, it likely would exhibit the problem you are experiencing.

arkadiyt commented 10 years ago

We've had that problem in the past with another flow but we got around it by creating an immediate firing timer after 99 tasks (ala the single reply in this thread: https://forums.aws.amazon.com/thread.jspa?messageID=359696#359696). For this particular flow however we're only scheduling 25 tasks per (child) workflow. I'll keep digging into it and see if I can come up with a reproducible test case

arkadiyt commented 10 years ago

As part of tracking this down I found it helpful to write myself a small utility that would generate visualizations of all the state machine transitions. This is unrelated to the ticket, but after doing so I noticed that you had an unreachable "started" state that seems to serve no purpose:

http://i.imgur.com/n59isjm.png

arkadiyt commented 10 years ago

So the current workflow is this: Schedule 25 async tasks, wait_for_all on them to complete, then schedule 1 synchronous task to merge their results and email a summary.

What I'm seeing is when the decider goes into decide_impl, single_decision_event will just have the WorkflowExecutionStarted event. When it calls event_loop here, that will add all the state machines for the scheduled tasks to the decision map by eventually invoking this code for Activity1 through Activity25.

The next time that decide_impl gets called, single_decision_event will be an array of a bunch of ActivityTaskScheduled's, followed by a bunch of ActivityTaskStarted's, followed by some ActivityTaskCompleted's. Since the decision_map has been populated with state machines for all of these activity tasks, the handle_decision_task_started call here will move all of the state machines from created to decision_sent.

Now if I continue stepping through then eventually I reach a point where single_decision_event will contain the ActivityTaskScheduled event for the last synchronous task - the one for the summary. In this case, the decision map doesn't get populated with the state machine for that task until event_loop gets called on the immediately prior event (the last ActivityTaskCompleted for the async tasks). So, the task got added after the handle_decision_task_started call, and therefore its state machine never got moved from created to decision_sent. Then when it tries to processes the event it raises the illegal transition error.

I tried to dig more into why event_loop added the state machines to the decision map for Activity1 through Activity25 before they were actually processed, but not for Activity26, but the gem code is very meta and I didn't get very far.

Does that explanation make sense? Do you see any red flags?

Any insight would be greatly appreciated!

Thanks

mjsteger commented 10 years ago

First off, that visualization is really cool. Do you think it would be possible to share the code you used to generate it?

There is a small problem with the visualization - it looks like it represents the DecisionStateMachineBase, which is why start appears to be an orphaned state(it's admittedly not great that it isn't connected in the base though, and that's a great catch!). Start appears to be used at least in the ChildWorkflowStateMachine.

Also, you shouldn't have to dictate the scheduling of activities yourself - Flow is supposed to abstract away scheduling even batches of activities such that end users don't have to manually code scheduling N activities to get around limits on SWF's side.

The explanation makes sense, though I don't follow on this last part : "So, the task got added after the handle_decision_task_started call, and therefore its state machine never got moved from created to decision_sent. Then when it tries to processes the event it raises the illegal transition error."

Based on my understanding, after eventLoop it should go back here as the scheduled should be a different "single_decision_event" and thus we should have done handle_decision_task_started_event in between. If it's in the same single_decision_event(haven't had a chance to test with the example you gave yet), then I'm pretty sure that's a bug.

So, just to clarify, event_loop is "run the users workflow code making as much forward progress as possible", and process_event is "process this particular event, possibly setting futures that the workflow code is using so as to enable more forward progress".

The reason event_loop added the state machines for 1-25 in the first step is because they are getting scheduled in parallel. Until you get to past the wait_for_all, you haven't yet evaluated the code that will produce Activity26, so there can't be a state machine representation for it.

Let me know if any of the above is in any way unclear.

Lastly, it's exciting to see someone be able to grok the code so quickly.

arkadiyt commented 10 years ago

Absolutely - it's just a quick and dirty script, with graphviz doing all of the heavy lifting: https://gist.github.com/arkadiyt/7364986

Yes, what I meant with that last part was that it was in the same single_decision_event, which it shouldn't be. However I can't make a minimal test case to reproduce it locally (though it is happening every day on our production environment).

Thanks for the explanations - that helps a lot!

arkadiyt commented 10 years ago

Not sure if this is the exact case we're hitting on production but I can reliably reproduce it locally using the following:

futures = []
futures << activity_client.send_async(:task1, 1)
futures << activity_client.send_async(:task1, 1.5)
wait_for_all(futures)
activity_client.task1(5)

The only thing that task1 does is sleep for however long its argument is.

After the first call to task1 completes, a decision task is started - I kill the decider here, causing a time out. Then I start the decider again and when it processes the events it runs into that illegal state transition error. It seemed significant in my testing that an activity task completed (in this case, the second call to task1) while a decision task was in the middle of executing.

I've attached the history of the execution (sorry it's an image) - I killed the decider at event id 13, causing the timeout at event id 14. Then I started the decider again and it continued processing until event id 21 (where the crash happens), and then I terminated the execution.

flows

Do you possibly see something in the gem code that might cause this behavior?

mjsteger commented 10 years ago

I've got a solid reproduction on the problem, and will be digging in more. Thanks for the clear/concise bug report!

arkadiyt commented 10 years ago

Hey mjsteger, was just wondering if there were any updates on this issue?

mjsteger commented 10 years ago

Sorry for the lack of update, we're still working on a fix.

mjsteger commented 10 years ago

Should be fixed as of a2da5c69fba7e6ac4b006f479a0bd17d7210b635(in particular, https://github.com/aws/aws-flow-ruby/blob/master/aws-flow/test/aws/decider_spec.rb#L1417 is our reproduction).

arkadiyt commented 10 years ago

Awesome, thanks!