amazon-archives / aws-flow-ruby

ARCHIVED
137 stars 58 forks source link

NoMethodError error in workflow poller #62

Closed Tsquare closed 9 years ago

Tsquare commented 10 years ago

I'm seeing intermittent errors like this in the AWS-Flow-WorkflowTaskPoller log file: I, [2014-07-08T21:43:39.541507 #6052] INFO -- : Got decision task ... E, [2014-07-08T21:43:44.207649 #6052] ERROR -- : Error in the poller, NoMethodError, undefined method `consume' for nil:NilClass

For the same flow with the same parameters, etc, we sometimes get this error with the result that decision tasks time out and the flow never completes, and sometimes everything completes normally.

pmohan6 commented 10 years ago

We definitely need to improve logging here. This error usually indicates that the state machine that was supposed to consume a symbol doesn't exist anymore.

Did you see this with the latest version of ruby flow?

We have seen this in a few cases - 1) Do you have a multi threaded model by any chance? Ruby flow is not tested to be thread safe. 2) Do you use timers? We recently fixed a bug in our Timer state machine that was causing this error. 3) Related to 2, do you schedule more than 100 decisions at once? In such a case, ruby flow will break the decisions in chunks of 99 + 1 (timer) and schedule them separately.

Can you share with us a small repro if you notice this with the latest version of ruby flow?

Tsquare commented 10 years ago

I haven't been able to try with the 2.0.0 gem version because we're on Ruby 2.0, but as soon as your fix is in I'll give it a shot. I can't get it to reproduce consistently, unfortunately. To your questions:

1) No multithreading 2) No timers in this particular flow (it does use lots of child workflows, and it looks like this error happens when a ChildWorkflowExecutionCompleted event occurs) 3) Not in this particular flow

pmohan6 commented 10 years ago

Can you give me a history of one of the workflows that timed out because of the error? I can try to run the history locally against the decider and get a repro.

Tsquare commented 10 years ago

What's the best way to get that to you? I can give you our account id & workflow id by email, for example.

pmohan6 commented 10 years ago

How large is the history? Can you just paste it here? I only need the sequence of events.

Tsquare commented 10 years ago

Pasting in. The last event that times out generates the NoMethodError in the log.

Thu Aug 21 21:52:48 GMT-700 2014 1 WorkflowExecutionStarted Thu Aug 21 21:52:48 GMT-700 2014 2 DecisionTaskScheduled Thu Aug 21 21:52:48 GMT-700 2014 3 DecisionTaskStarted Thu Aug 21 21:52:50 GMT-700 2014 4 DecisionTaskCompleted Thu Aug 21 21:52:50 GMT-700 2014 5 ActivityTaskScheduled Thu Aug 21 21:52:50 GMT-700 2014 6 ActivityTaskScheduled Thu Aug 21 21:52:50 GMT-700 2014 7 ActivityTaskScheduled Thu Aug 21 21:52:50 GMT-700 2014 8 ActivityTaskScheduled Thu Aug 21 21:52:50 GMT-700 2014 9 ActivityTaskStarted Thu Aug 21 21:52:50 GMT-700 2014 10 ActivityTaskStarted Thu Aug 21 21:52:50 GMT-700 2014 11 ActivityTaskStarted Thu Aug 21 21:52:50 GMT-700 2014 12 ActivityTaskStarted Thu Aug 21 22:00:56 GMT-700 2014 13 ActivityTaskCompleted Thu Aug 21 22:00:56 GMT-700 2014 14 DecisionTaskScheduled Thu Aug 21 22:00:57 GMT-700 2014 15 DecisionTaskStarted Thu Aug 21 22:00:58 GMT-700 2014 16 DecisionTaskCompleted Thu Aug 21 22:01:02 GMT-700 2014 17 ActivityTaskCompleted Thu Aug 21 22:01:02 GMT-700 2014 18 DecisionTaskScheduled Thu Aug 21 22:01:02 GMT-700 2014 19 DecisionTaskStarted Thu Aug 21 22:01:03 GMT-700 2014 20 DecisionTaskCompleted Thu Aug 21 22:01:24 GMT-700 2014 21 ActivityTaskCompleted Thu Aug 21 22:01:24 GMT-700 2014 22 DecisionTaskScheduled Thu Aug 21 22:01:24 GMT-700 2014 23 DecisionTaskStarted Thu Aug 21 22:01:25 GMT-700 2014 24 DecisionTaskCompleted Thu Aug 21 22:01:31 GMT-700 2014 25 ActivityTaskCompleted Thu Aug 21 22:01:31 GMT-700 2014 26 DecisionTaskScheduled Thu Aug 21 22:01:31 GMT-700 2014 27 DecisionTaskStarted Thu Aug 21 22:01:32 GMT-700 2014 28 DecisionTaskCompleted Thu Aug 21 22:01:32 GMT-700 2014 29 ActivityTaskScheduled Thu Aug 21 22:01:32 GMT-700 2014 30 ActivityTaskStarted Thu Aug 21 22:01:35 GMT-700 2014 31 ActivityTaskCompleted Thu Aug 21 22:01:35 GMT-700 2014 32 DecisionTaskScheduled Thu Aug 21 22:01:35 GMT-700 2014 33 DecisionTaskStarted Thu Aug 21 22:01:37 GMT-700 2014 34 DecisionTaskCompleted Thu Aug 21 22:01:37 GMT-700 2014 35 ActivityTaskScheduled Thu Aug 21 22:01:37 GMT-700 2014 36 ActivityTaskStarted Thu Aug 21 22:01:40 GMT-700 2014 37 ActivityTaskCompleted Thu Aug 21 22:01:40 GMT-700 2014 38 DecisionTaskScheduled Thu Aug 21 22:01:40 GMT-700 2014 39 DecisionTaskStarted Thu Aug 21 22:01:42 GMT-700 2014 40 DecisionTaskCompleted Thu Aug 21 22:01:42 GMT-700 2014 41 ActivityTaskScheduled Thu Aug 21 22:01:43 GMT-700 2014 42 ActivityTaskStarted Thu Aug 21 22:01:45 GMT-700 2014 43 ActivityTaskCompleted Thu Aug 21 22:01:45 GMT-700 2014 44 DecisionTaskScheduled Thu Aug 21 22:01:45 GMT-700 2014 45 DecisionTaskStarted Thu Aug 21 22:01:48 GMT-700 2014 46 DecisionTaskCompleted Thu Aug 21 22:01:48 GMT-700 2014 47 ActivityTaskScheduled Thu Aug 21 22:01:48 GMT-700 2014 48 ActivityTaskStarted Thu Aug 21 22:01:51 GMT-700 2014 49 ActivityTaskCompleted Thu Aug 21 22:01:51 GMT-700 2014 50 DecisionTaskScheduled Thu Aug 21 22:01:51 GMT-700 2014 51 DecisionTaskStarted Thu Aug 21 22:01:54 GMT-700 2014 52 DecisionTaskCompleted Thu Aug 21 22:01:54 GMT-700 2014 53 ActivityTaskScheduled Thu Aug 21 22:01:54 GMT-700 2014 54 ActivityTaskStarted Thu Aug 21 22:01:57 GMT-700 2014 55 ActivityTaskCompleted Thu Aug 21 22:01:57 GMT-700 2014 56 DecisionTaskScheduled Thu Aug 21 22:01:57 GMT-700 2014 57 DecisionTaskStarted Thu Aug 21 22:02:00 GMT-700 2014 58 DecisionTaskCompleted Thu Aug 21 22:02:00 GMT-700 2014 59 ActivityTaskScheduled Thu Aug 21 22:02:00 GMT-700 2014 60 ActivityTaskStarted Thu Aug 21 22:02:03 GMT-700 2014 61 ActivityTaskCompleted Thu Aug 21 22:02:03 GMT-700 2014 62 DecisionTaskScheduled Thu Aug 21 22:02:03 GMT-700 2014 63 DecisionTaskStarted Thu Aug 21 22:02:08 GMT-700 2014 64 DecisionTaskCompleted Thu Aug 21 22:02:08 GMT-700 2014 65 ActivityTaskScheduled Thu Aug 21 22:02:08 GMT-700 2014 66 ActivityTaskScheduled Thu Aug 21 22:02:08 GMT-700 2014 67 ActivityTaskScheduled Thu Aug 21 22:02:08 GMT-700 2014 68 ActivityTaskScheduled Thu Aug 21 22:02:08 GMT-700 2014 69 ActivityTaskScheduled Thu Aug 21 22:02:08 GMT-700 2014 70 ActivityTaskScheduled Thu Aug 21 22:02:08 GMT-700 2014 71 ActivityTaskStarted Thu Aug 21 22:02:08 GMT-700 2014 72 ActivityTaskStarted Thu Aug 21 22:02:08 GMT-700 2014 73 ActivityTaskStarted Thu Aug 21 22:02:08 GMT-700 2014 74 ActivityTaskStarted Thu Aug 21 22:02:08 GMT-700 2014 75 ActivityTaskStarted Thu Aug 21 22:02:08 GMT-700 2014 76 ActivityTaskStarted Thu Aug 21 22:02:13 GMT-700 2014 77 ActivityTaskCompleted Thu Aug 21 22:02:13 GMT-700 2014 78 DecisionTaskScheduled Thu Aug 21 22:02:13 GMT-700 2014 79 DecisionTaskStarted Thu Aug 21 22:02:18 GMT-700 2014 80 DecisionTaskCompleted Thu Aug 21 22:02:44 GMT-700 2014 81 ActivityTaskCompleted Thu Aug 21 22:02:44 GMT-700 2014 82 DecisionTaskScheduled Thu Aug 21 22:02:44 GMT-700 2014 83 DecisionTaskStarted Thu Aug 21 22:02:49 GMT-700 2014 84 DecisionTaskCompleted Thu Aug 21 22:03:06 GMT-700 2014 85 ActivityTaskCompleted Thu Aug 21 22:03:06 GMT-700 2014 86 DecisionTaskScheduled Thu Aug 21 22:03:06 GMT-700 2014 87 DecisionTaskStarted Thu Aug 21 22:03:07 GMT-700 2014 88 ActivityTaskCompleted Thu Aug 21 22:03:07 GMT-700 2014 89 DecisionTaskScheduled Thu Aug 21 22:03:11 GMT-700 2014 90 DecisionTaskCompleted Thu Aug 21 22:03:11 GMT-700 2014 91 DecisionTaskStarted Thu Aug 21 22:03:16 GMT-700 2014 92 DecisionTaskCompleted Thu Aug 21 22:03:32 GMT-700 2014 93 ActivityTaskCompleted Thu Aug 21 22:03:32 GMT-700 2014 94 DecisionTaskScheduled Thu Aug 21 22:03:32 GMT-700 2014 95 DecisionTaskStarted Thu Aug 21 22:03:37 GMT-700 2014 96 DecisionTaskCompleted Thu Aug 21 22:05:53 GMT-700 2014 97 ActivityTaskCompleted Thu Aug 21 22:05:53 GMT-700 2014 98 DecisionTaskScheduled Thu Aug 21 22:05:53 GMT-700 2014 99 DecisionTaskStarted Thu Aug 21 22:05:59 GMT-700 2014 100 DecisionTaskCompleted Thu Aug 21 22:05:59 GMT-700 2014 101 ActivityTaskScheduled Thu Aug 21 22:05:59 GMT-700 2014 102 ActivityTaskStarted Thu Aug 21 22:08:17 GMT-700 2014 103 ActivityTaskCompleted Thu Aug 21 22:08:17 GMT-700 2014 104 DecisionTaskScheduled Thu Aug 21 22:08:17 GMT-700 2014 105 DecisionTaskStarted Thu Aug 21 22:08:23 GMT-700 2014 106 DecisionTaskCompleted Thu Aug 21 22:08:23 GMT-700 2014 107 ActivityTaskScheduled Thu Aug 21 22:08:23 GMT-700 2014 108 ActivityTaskStarted Thu Aug 21 23:04:13 GMT-700 2014 109 ActivityTaskCompleted Thu Aug 21 23:04:13 GMT-700 2014 110 DecisionTaskScheduled Thu Aug 21 23:04:13 GMT-700 2014 111 DecisionTaskStarted Thu Aug 21 23:04:21 GMT-700 2014 112 DecisionTaskCompleted Thu Aug 21 23:04:21 GMT-700 2014 113 StartChildWorkflowExecutionInitiated Thu Aug 21 23:04:21 GMT-700 2014 114 StartChildWorkflowExecutionInitiated Thu Aug 21 23:04:21 GMT-700 2014 115 StartChildWorkflowExecutionInitiated Thu Aug 21 23:04:21 GMT-700 2014 116 ChildWorkflowExecutionStarted Thu Aug 21 23:04:21 GMT-700 2014 117 DecisionTaskScheduled Thu Aug 21 23:04:21 GMT-700 2014 118 ChildWorkflowExecutionStarted Thu Aug 21 23:04:21 GMT-700 2014 119 DecisionTaskStarted Thu Aug 21 23:04:21 GMT-700 2014 120 ChildWorkflowExecutionStarted Thu Aug 21 23:04:21 GMT-700 2014 121 DecisionTaskScheduled Thu Aug 21 23:04:28 GMT-700 2014 122 DecisionTaskCompleted Thu Aug 21 23:04:28 GMT-700 2014 123 DecisionTaskStarted Thu Aug 21 23:04:29 GMT-700 2014 124 ChildWorkflowExecutionCompleted Thu Aug 21 23:04:29 GMT-700 2014 125 DecisionTaskScheduled Thu Aug 21 23:04:35 GMT-700 2014 126 DecisionTaskCompleted Thu Aug 21 23:04:35 GMT-700 2014 127 DecisionTaskStarted Thu Aug 21 23:04:41 GMT-700 2014 128 DecisionTaskCompleted Thu Aug 21 23:05:04 GMT-700 2014 129 ChildWorkflowExecutionCompleted Thu Aug 21 23:05:04 GMT-700 2014 130 DecisionTaskScheduled Thu Aug 21 23:05:04 GMT-700 2014 131 DecisionTaskStarted Thu Aug 21 23:05:11 GMT-700 2014 132 DecisionTaskCompleted Thu Aug 21 23:05:48 GMT-700 2014 133 ChildWorkflowExecutionCompleted Thu Aug 21 23:05:48 GMT-700 2014 134 DecisionTaskScheduled Thu Aug 21 23:05:48 GMT-700 2014 135 DecisionTaskStarted Thu Aug 21 23:05:55 GMT-700 2014 136 DecisionTaskCompleted Thu Aug 21 23:05:55 GMT-700 2014 137 StartChildWorkflowExecutionInitiated Thu Aug 21 23:05:55 GMT-700 2014 138 StartChildWorkflowExecutionInitiated Thu Aug 21 23:05:55 GMT-700 2014 139 ChildWorkflowExecutionStarted Thu Aug 21 23:05:55 GMT-700 2014 140 DecisionTaskScheduled Thu Aug 21 23:05:55 GMT-700 2014 141 ChildWorkflowExecutionStarted Thu Aug 21 23:05:55 GMT-700 2014 142 DecisionTaskStarted Thu Aug 21 23:06:02 GMT-700 2014 143 DecisionTaskCompleted Thu Aug 21 23:13:56 GMT-700 2014 144 ChildWorkflowExecutionCompleted Thu Aug 21 23:13:56 GMT-700 2014 145 DecisionTaskScheduled Thu Aug 21 23:13:56 GMT-700 2014 146 DecisionTaskStarted Thu Aug 21 23:14:05 GMT-700 2014 147 DecisionTaskCompleted Thu Aug 21 23:37:56 GMT-700 2014 148 ChildWorkflowExecutionCompleted Thu Aug 21 23:37:56 GMT-700 2014 149 DecisionTaskScheduled Thu Aug 21 23:37:56 GMT-700 2014 150 DecisionTaskStarted Thu Aug 21 23:38:04 GMT-700 2014 151 DecisionTaskCompleted Thu Aug 21 23:38:04 GMT-700 2014 152 StartChildWorkflowExecutionInitiated Thu Aug 21 23:38:04 GMT-700 2014 153 StartChildWorkflowExecutionInitiated Thu Aug 21 23:38:04 GMT-700 2014 154 StartChildWorkflowExecutionInitiated Thu Aug 21 23:38:04 GMT-700 2014 155 ChildWorkflowExecutionStarted Thu Aug 21 23:38:04 GMT-700 2014 156 DecisionTaskScheduled Thu Aug 21 23:38:04 GMT-700 2014 157 ChildWorkflowExecutionStarted Thu Aug 21 23:38:04 GMT-700 2014 158 ChildWorkflowExecutionStarted Thu Aug 21 23:38:04 GMT-700 2014 159 DecisionTaskStarted Thu Aug 21 23:38:14 GMT-700 2014 160 DecisionTaskCompleted Thu Aug 21 23:41:04 GMT-700 2014 161 ChildWorkflowExecutionCompleted Thu Aug 21 23:41:04 GMT-700 2014 162 DecisionTaskScheduled Thu Aug 21 23:41:04 GMT-700 2014 163 DecisionTaskStarted Thu Aug 21 23:41:13 GMT-700 2014 164 DecisionTaskCompleted Thu Aug 21 23:41:16 GMT-700 2014 165 ChildWorkflowExecutionCompleted Thu Aug 21 23:41:16 GMT-700 2014 166 DecisionTaskScheduled Thu Aug 21 23:41:16 GMT-700 2014 167 DecisionTaskStarted Thu Aug 21 23:41:23 GMT-700 2014 168 DecisionTaskCompleted Fri Aug 22 01:00:48 GMT-700 2014 169 ChildWorkflowExecutionCompleted Fri Aug 22 01:00:48 GMT-700 2014 170 DecisionTaskScheduled Fri Aug 22 01:00:49 GMT-700 2014 171 DecisionTaskStarted Fri Aug 22 01:10:49 GMT-700 2014 172 DecisionTaskTimedOut

pmohan6 commented 10 years ago

actually, it might be a better idea to send the workflow details to me. Can you give me the account id, swf endpoint, domain, workflow id and run id via AWS forums? You can send me a private message using this link - https://forums.aws.amazon.com/pmpost!default.jspa. My forum handle is paritoshataws Also, let me know if you encountered this error in aws-flow-2.0.1.

Tsquare commented 10 years ago

When I try, I get:

Error sending message - the recipient is not allowed to receive private messages.

Tsquare commented 10 years ago

Just saw the issue in aws-flow-2.0.1, sorry to say

pmohan6 commented 10 years ago

Just wanted to let you know that I'm working on resolving the forum issue so that you can send me the workflow details.

pmohan6 commented 10 years ago

Can you try sending the details in a private message to 'asadaws' on this link - https://forums.aws.amazon.com/pmpost!default.jspa

Tsquare commented 10 years ago

same error

pmohan6 commented 10 years ago

Sorry about the delay. Can you post a new thread on the AWS SimpleWorkflow forums here - https://forums.aws.amazon.com/forum.jspa?forumID=133. That way our cloud support can look up your account. You can just provide a link on the forum post to this github page.

Tsquare commented 10 years ago

Done: https://forums.aws.amazon.com/thread.jspa?messageID=568863&#568863

pmohan6 commented 9 years ago

I haven't been able to repro using the workflow code and history that you sent me. Will it be possible for you to write a minimal repro? Btw, do any of your child workflows/activities return more than 32k characters of data?

Tsquare commented 9 years ago

Unfortunately I can't repro reilably. The answer to your 2nd question is no.

pmohan6 commented 9 years ago

I'm working on improving logging. Hopefully that will help us narrow down the issue.

pmohan6 commented 9 years ago

Just a random thought, but do you have non deterministic code in your workflow? Workflow code should be deterministic and should always produce the same result given a particular history. Non-determinism could cause these no method errors.

Tsquare commented 9 years ago

Don't see anything from a quick scan (nothing dependent on time/date, or anything else that could change between/while workflow runs)

pmohan6 commented 9 years ago

We recently released a WorkflowReplayer for ruby flow. The Replayer can be used to 'replay' the history of a particular workflow execution in the decider. You can start a debugger and step through it and see where the issue lies. Can you try replaying your workflow history? You can see a code snippet on how to use the Replayer here - https://github.com/aws/aws-flow-ruby/pull/72

pmohan6 commented 9 years ago

Unfortunately, we are not able to reproduce this issue ourselves. It's possible that there is some state machine bug or some weird race condition but a minimal repro is needed to isolate the issue. Were you able to use the replayer to isolate it? Do you continue to see this issue?

Tsquare commented 9 years ago

sorry, have not been able to repro with replayer. the plan is to work around this issue by changing our flow

pmohan6 commented 9 years ago

I'm resolving this issue for now because we haven't been able to repro it and don't have a path forward. Feel free to reopen if you have any new findings.