Open ThomasBleijendaal opened 2 years ago
Adding @jviau to this discussion. Handling activity and sub-orchestration retries correctly is one potential gap we recently identified in the rewind implementation. How to handle timers has been another area where we know we needed to do more thinking. It sounds like you've identified a specific bug and a potential fix, which is really helpful.
Just to make your example more concrete, can you share an example of what the orchestration looks like that reproduces this error? For example, is it simply an orchestration that calls one activity, retries once, and then fails?
The orchestration is quite simple, and it boils down to:
OrchestrationContext.CallActivityWithRetryAsync<Response>(
"activity",
new RetryOptions(DelayOrDefault(delay), maxAttempts)
{
BackoffCoefficient = 2
},
new Request());
The "activity"
function is just throwing a InvalidOperationException
. After the initial run, the orchestration failed because "activity"
always throws. The history table has 3 sets of TaskScheduled
+ TaskFailed
(and 3 sets of TimerCreated
and TimerFired
).
I modify "activity"
to not throw and then send a Rewind request. The TaskScheduled
and TaskFailed
events are reset to GenericEvent
. "activity"
is triggered correctly. Only a new TaskScheduled
event is added to the history. The orchestrator resumes and then finds the TimerCreated
it does not expect, and completes with a Failed
state, even before the "activity"
function completes.
After facing the same issue described here and some other issues as well, I've implemented a more advanced rewind algorithm in EFCoreOrchestrationServiceClient.cs. It's working great for me so far.
First, we should consider that rewind is not only used on failed orchestrations, you could use rewind to re-open a completed or terminated orchestration. Imagine you fix your workflow by adding extra steps at the end for example and then rewind the execution to reopen it. This is how I find the optimal rewind point:
Once a rewind point is found, all events after that point must be rewound (converted into GenericEvent). With no exception, otherwise, there is a risk of Non-Deterministic errors.
Every TaskScheduled and TimerCreated event kept that had its corresponding completion event rewound must be rescheduled. To be able to do that, you need to store Activity and Timers inputs in your history table as well.
Every SubOrchestrationInstanceCreated event kept that had its corresponding completion event rewound must have the suborchestration rewound as well, using the same logic described above to find the optimal rewind point. Failed SubOrchestrations will be rewound to the last failed activity/suborchestration as well, while non-failed ones will just be reopened and will fire the orchestration completion message again.
This function implements the rewind and identifies all messages that must be scheduled and suborchestrations that must be rewound.
And finally, if the current orchestration has a parent, the parent must be rewound as well, but instead of using the rewind point logic from above, it should be rewound exactly to the "SubOrchestrationInstanceCompleted" related to the current orchestration.
The algorithm described above is strong enough to rewind to any history point, so, you could expose a new API that let users rewind orchestrations to the point they want as well.
Any movement here @cgillum? Rewind is pretty useless if we can't use resilient retry policies alongside it. Conversely, rewind functionality is pretty awesome if we can use it alongside resilient retry policies.
No updates. This item unfortunately hasn't made it high enough in the team's backlog.
No updates. This item unfortunately hasn't made it high enough in the team's backlog.
Anything I/we can do to help with that? This is a pretty major feature for our team.
Adding @lilyjma, who's helping manage our backlog.
We accept pull requests. However, one of our goals for improving this feature is to rewrite it so that it's simpler and works for all backend types (Azure Storage, Netherite, MSSQL, etc.). The currently implementation only works for Azure Storage. There is a brief proposal here if you're interested in taking a look and potentially contributing: https://github.com/Azure/durabletask/issues/731.
Definitely interested thanks for the pointers. Will be taking a look.
When calling an activity with retry (via
ScheduleTask
viaCallActivityWithRetryAsync
, from durable task extension), theRetryInterceptor
retries the activities if they fail. For each retry, aTimerCreated
(and consequentTimerFired
) events are added to the history of the orchestration.(And due to the following:
after the last attempt, another
TimerEvent
andTimerFired
event are added).When
AzureTableTrackingStore.RewindHistoryAsync
is called to rewind the orchestration, onlyTaskFailed
andSubOrchestrationInstanceFailed
(and their correspondingTaskScheduled
andSubOrchestrationInstanceCreated
) get theirEventType
reset toGenericEvent
. So when the orchestration restarts, it encountersTimerCreated
andTimerFired
events that it did not expect, and causes the following error:I think to fix this, the rewind algorithm should take the timer events into account, and also overwrite their
EventType
toGenericEvent
. I've tested this by modifying the table storage entries before rewinding and that works. I can imagine that the fix is to find all theTimerCreated
events that have anEventId
higher than theTaskScheduled
that is being reset. The correspondingTimerFired
events can be found using theTimerId
property.I don't mind implementing the fix for this, but I would like to know if this is the best approach. I can imagine that this change can inadvertently reset some timers it should not touch. But as the Rewind algorithm just resets all the
TaskFailed
events, resetting the timer events after those events might just work fine.