Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.53k stars 296 forks source link

Rewind fails to rewind activities that were called with retry #811

Open ThomasBleijendaal opened 2 years ago

ThomasBleijendaal commented 2 years ago

When calling an activity with retry (via ScheduleTask via CallActivityWithRetryAsync, from durable task extension), the RetryInterceptor retries the activities if they fail. For each retry, a TimerCreated (and consequent TimerFired) events are added to the history of the orchestration.

(And due to the following:

if (isLastRetry)
{
    // Earlier versions of this retry interceptor had a bug that scheduled an extra delay timer.
    // It's unfortunately not possible to remove the extra timer since that would potentially
    // break the history replay for existing orchestrations. Instead, we do the next best thing
    // and schedule a timer that fires immediately instead of waiting for a full delay interval.
    await this.context.CreateTimer(this.context.CurrentUtcDateTime, "Dummy timer for back-compat");
    break;
}

after the last attempt, another TimerEvent and TimerFired event are added).

When AzureTableTrackingStore.RewindHistoryAsync is called to rewind the orchestration, only TaskFailed and SubOrchestrationInstanceFailed (and their corresponding TaskScheduled and SubOrchestrationInstanceCreated) get their EventType reset to GenericEvent. So when the orchestration restarts, it encounters TimerCreated and TimerFired events that it did not expect, and causes the following error:

Non-Deterministic workflow detected: A previous execution of this orchestration scheduled a
timer task with sequence number 1 but the current replay execution hasn't (yet?) scheduled this 
task. Was a change made to the orchestrator code after this instance had already startedrunning?

I think to fix this, the rewind algorithm should take the timer events into account, and also overwrite their EventType to GenericEvent. I've tested this by modifying the table storage entries before rewinding and that works. I can imagine that the fix is to find all the TimerCreated events that have an EventId higher than the TaskScheduled that is being reset. The corresponding TimerFired events can be found using the TimerId property.

I don't mind implementing the fix for this, but I would like to know if this is the best approach. I can imagine that this change can inadvertently reset some timers it should not touch. But as the Rewind algorithm just resets all the TaskFailed events, resetting the timer events after those events might just work fine.

cgillum commented 2 years ago

Adding @jviau to this discussion. Handling activity and sub-orchestration retries correctly is one potential gap we recently identified in the rewind implementation. How to handle timers has been another area where we know we needed to do more thinking. It sounds like you've identified a specific bug and a potential fix, which is really helpful.

Just to make your example more concrete, can you share an example of what the orchestration looks like that reproduces this error? For example, is it simply an orchestration that calls one activity, retries once, and then fails?

ThomasBleijendaal commented 2 years ago

The orchestration is quite simple, and it boils down to:

OrchestrationContext.CallActivityWithRetryAsync<Response>(
  "activity", 
  new RetryOptions(DelayOrDefault(delay), maxAttempts)
  {
      BackoffCoefficient = 2
  },
  new Request());

The "activity" function is just throwing a InvalidOperationException. After the initial run, the orchestration failed because "activity" always throws. The history table has 3 sets of TaskScheduled + TaskFailed (and 3 sets of TimerCreated and TimerFired).

I modify "activity" to not throw and then send a Rewind request. The TaskScheduled and TaskFailed events are reset to GenericEvent. "activity" is triggered correctly. Only a new TaskScheduled event is added to the history. The orchestrator resumes and then finds the TimerCreated it does not expect, and completes with a Failed state, even before the "activity" function completes.

lucaslorentz commented 2 years ago

After facing the same issue described here and some other issues as well, I've implemented a more advanced rewind algorithm in EFCoreOrchestrationServiceClient.cs. It's working great for me so far.

First, we should consider that rewind is not only used on failed orchestrations, you could use rewind to re-open a completed or terminated orchestration. Imagine you fix your workflow by adding extra steps at the end for example and then rewind the execution to reopen it. This is how I find the optimal rewind point:

Once a rewind point is found, all events after that point must be rewound (converted into GenericEvent). With no exception, otherwise, there is a risk of Non-Deterministic errors.

Every TaskScheduled and TimerCreated event kept that had its corresponding completion event rewound must be rescheduled. To be able to do that, you need to store Activity and Timers inputs in your history table as well.

Every SubOrchestrationInstanceCreated event kept that had its corresponding completion event rewound must have the suborchestration rewound as well, using the same logic described above to find the optimal rewind point. Failed SubOrchestrations will be rewound to the last failed activity/suborchestration as well, while non-failed ones will just be reopened and will fire the orchestration completion message again.

This function implements the rewind and identifies all messages that must be scheduled and suborchestrations that must be rewound.

And finally, if the current orchestration has a parent, the parent must be rewound as well, but instead of using the rewind point logic from above, it should be rewound exactly to the "SubOrchestrationInstanceCompleted" related to the current orchestration.

The algorithm described above is strong enough to rewind to any history point, so, you could expose a new API that let users rewind orchestrations to the point they want as well.

boylec commented 1 year ago

Any movement here @cgillum? Rewind is pretty useless if we can't use resilient retry policies alongside it. Conversely, rewind functionality is pretty awesome if we can use it alongside resilient retry policies.

cgillum commented 1 year ago

No updates. This item unfortunately hasn't made it high enough in the team's backlog.

boylec commented 1 year ago

No updates. This item unfortunately hasn't made it high enough in the team's backlog.

Anything I/we can do to help with that? This is a pretty major feature for our team.

cgillum commented 1 year ago

Adding @lilyjma, who's helping manage our backlog.

We accept pull requests. However, one of our goals for improving this feature is to rewrite it so that it's simpler and works for all backend types (Azure Storage, Netherite, MSSQL, etc.). The currently implementation only works for Azure Storage. There is a brief proposal here if you're interested in taking a look and potentially contributing: https://github.com/Azure/durabletask/issues/731.

boylec commented 1 year ago

Definitely interested thanks for the pointers. Will be taking a look.