Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

Official support for rewinding failed orchestrations #731

Open cgillum opened 2 years ago

cgillum commented 2 years ago

Problem

There are cases when an orchestrator instance can fail due to a bug, an environment issue, etc. In such cases, the orchestration fails, and the process needs to be manually restarted from scratch. This is not ideal because often times the process involved several complex steps, human interactions, etc., and it's not always practical to rebuild that state.

Proposal

The durable task framework uses an event sourcing model under the hood to manage state. This means that in theory it should be possible to "rewind" back to a previously known good state and restart execution from there. Exposing this as a feature would be a great way to recover failed orchestrations (after the underlying issue has been fixed).

Task RewindTaskOrchestrationAsync(string instanceId, string reason);

Design notes

Rewind should work on failed orchestrations (the design could potentially be extended to terminated and completed orchestrations as well). Internally this API enqueues a message that gets picked up by a worker. When the TaskOrchestrationDispatcher receives the message, it will update the in-memory history to remove the last failure and then replay the orchestration with the updated history.

Note that in order for this to work, the storage backends must be willing to process messages for orchestrations in a failed state. This may not be the case today, in which case an alternate design could be to expose a new method on IOrchestrationService for this functionality. However, this has two problems: 1) it still requires existing backends to make changes to support it, 2) new backends will also be required to implement this new capability explicitly, and 3) it won't be feasible to trigger this from a client operation, since TaskHubClient doesn't have direct access to IOrchestrationService methods.

Prior work

This was done originally for Durable Functions but was never directly exposed to DTFx. Also, the core logic was only implemented for the Azure Storage backend (DurableTask.AzureStorage).

To improve on the prior work, we should do two things:

  1. Expose rewind functionality directly in the DurableTask.Core APIs.
  2. Implement the core functionality in DurableTask.Core in such a way that it can work with any backend store without modification.

Implementation notes

As part of making this a generic feature, we should replace the DurableTask.AzureStorage implementation with a generic implementation.

jianjunwang2 commented 1 year ago

Chris, good to see we want to officially support it in the DT Core. Do you have any ETA for this work?

jundayin commented 1 year ago

hi, any update on the progress. Is this something straightforward enough that community can contribute

cgillum commented 1 year ago

No ETA currently. There's general agreement that we should do this, but we haven't been able to prioritize it highly enough in our team backlog yet. There are also a few design questions that would need to be resolved, such as how to deal with failure retries and timers.

We're always open to community contributions. However, I don't think it will be especially straightforward. At this point, any potential contribution would need to first include a design proposal that can be agreed upon by the project maintainers.

boylec commented 1 year ago

Initial Knee Jerk Idea

Just wanted to float the following super simple/rough idea as a sanity check. Just illustrating a high level view of what might occur when the rewind/rerun/revive is invoked - not implementation details.

'Rewind' doesn't imply whether the supplied event id (if the consumer chooses to supply one) is inclusive (vs exclusive), so I thought that maybe the name "RerunFromEvent" or some such might imply that the supplied event id is going to be removed.

// on IOrchestrationService
Task RerunFromEventAsync(string orchestrationInstanceId, string? startingWithEventId = null, string? archivePartitionSuffix = null) {
  ...
}
graph TD;

invoke(User Invokes RerunFromEventAsync)
isSuffixSupplied{archivePartitionSuffix supplied?}
copyHistory[Copy history appending\nsupplied suffix to\norchestration instance id\nfor the copy's instance id]
isEventIdSupplied{Is a specific event id supplied?}
eventXIsSuppliedEvent[Event X is supplied event id]
eventXIsLastFailedEvent[Event X is last failed event id]
deleteAllHistoryFromEvent[Delete all history\nfrom event X\n and onwards for\nthe orchestration instance ID\nprovided]
runOrchestration[Invoke orchestration so\nthat it picks up\nfrom the last event]

invoke --> isSuffixSupplied;
isSuffixSupplied -- yes --> copyHistory --> isEventIdSupplied;
isSuffixSupplied -- no --> isEventIdSupplied;
isEventIdSupplied -- yes --> eventXIsSuppliedEvent --> deleteAllHistoryFromEvent;
isEventIdSupplied -- no --> eventXIsLastFailedEvent --> deleteAllHistoryFromEvent;
deleteAllHistoryFromEvent --> runOrchestration;

Any feedback?

I think this tackles dealing w/ retries and timers. Am I vastly underestimating the complexity here? Is deleting history sacrilegious?

When I think of event sourcing I think of a rewind as a very "hard" operation not to be taken lightly by the rewinder. AKA:

You are only doing this because things got seriously borked. So let us go ahead and cut off the history (optionally archiving it beforehand) before triggering a replay.

nilsmehlhorn commented 1 year ago

We were in need of a solution and therefore came up with our own. I know this won't be directly transferable to an official implementation, but maybe it helps in finding and evaluating ideas.

Error handling based on custom idempotency layer We've implemented a custom idempotency mechanism on top of the Durable Functions .NET API for handling activities which can't be easily made idempotent. It's not meant nor able to implement full idempotency, but rather offer a generic way of preventing duplication of non-critical side effects like e-mail sending or acceptance of external events. The following APIs are available as extension methods on [`IDurableOrchestrationContext`](https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.webjobs.extensions.durabletask.idurableorchestrationcontext?view=azure-dotnet): `CallActivityIdempotentAsync`: wraps an Activity call so that it's only invoked when the idempotency layer has not yet recorded it's execution `WaitForExternalEventIdempotent`: waits for an external event only if it hasn't yet been received and thus recorded by the idempotency layer The idempotency layer is backed up by an Azure Storage Table where each executed operation is recorded along with its (optional) input and output indexed by an idempotency key. | IdempotencyKey | Operation | Input | Output | |----------------|----------------------------|--------------------|-----------------------| | I1 | Activity[SendWelcomeEmail] | {userId: "usr_12"} | null | | I2 | Event[InvoiceUploaded] | null | {invoiceId: "inv_23"} | The APIs allow setting a custom idempotency key, however, by default the instance ID of the executing orchestrator is used. This way we basically associate a workflow instance with a specific side effect. Now, when one of the idempotency layer methods is invoked it'll check the table so see whether the underlying operation has already been executed. If no matching entry is found, the operation is executed and the table updated. If there's a matching entry, the idempotency layer only returns the recorded output from the original execution. Note that the idempotency layer will verify that the inputs from any subsequent calls with the same idempotency key match with the ones from the original execution in order to guarantee that the underlying side effects would be identical. If there's a mismatch, an exception will be thrown. When an unhandled exception occurs the corresponding orchestrator will enter the runtime status "Failed". In complex orchestrations, especially when sub-orchestrators or timers are involved, this unfortunately isn't easily recoverable (see [this issue](https://github.com/Azure/azure-functions-durable-extension/issues/617#issuecomment-1243613332)). This has been a huge motivation for implementing our custom idempotency layer because it now enables us to terminate and purge a failed orchestration in order to restart it from scratch so that the failing operation can be reliably retried (ideally after fixing the cause of failure). Restarting a workflow works as follows: 1. Terminate and purge all orchestrators related to the workflow. Prefixing sub-orchestrators with a common scheme helps (see [this issue](https://github.com/Azure/azure-functions-durable-extension/issues/506)). We're commonly provisioning a convenience endpoint per workflow for this within our function app. 2. Start the top-most orchestrator workflow with the same instance ID

What we're basically doing is keeping our own orchestrator history (only where necessary) as the official one is proprietary (i.e. undocumented, doesn't provide an API, can change). Then, when something goes wrong we discard the official history and restart the orchestrator (indirectly incl. sub-orchestrators it's creating) from scratch.

All activities that are inherently idempotent won't run unintended side-effects by themselves. Other activities and incoming real external events (that aren't sent by the orchestrators themselves) are handled by our custom history (aka. idempotency layer). Since our idempotency layer only records successful interactions (not perfect idempotency, I know), this mechanism results in the orchestrator resuming to the state where it left of before an error occurred.

@boylec Intuitively, I think your approach could be fine, maybe a bit low-level because when writing orchestator code you're not dealing with the underlying events in the history. Maybe there'd be merit providing a proper API on top of the history first.

boylec commented 1 year ago

We were in need of a solution and therefore came up with our own. I know this won't be directly transferable to an official implementation, but maybe it helps in finding and evaluating ideas.

Error handling based on custom idempotency layer What we're basically doing is keeping our own orchestrator history (only where necessary) as the official one is proprietary (i.e. undocumented, doesn't provide an API, can change). Then, when something goes wrong we discard the official history and restart the orchestrator (indirectly incl. sub-orchestrators it's creating) from scratch.

All activities that are inherently idempotent won't run unintended side-effects by themselves. Other activities and incoming real external events (that aren't sent by the orchestrators themselves) are handled by our custom history (aka. idempotency layer). Since our idempotency layer only records successful interactions (not perfect idempotency, I know), this mechanism results in the orchestrator resuming to the state where it left of before an error occurred.

@boylec Intuitively, I think your approach could be fine, maybe a bit low-level because when writing orchestator code you're not dealing with the underlying events in the history. Maybe there'd be merit providing a proper API on top of the history first.

Thanks for feedback. Unfortunately I've been slammed enough where I don't think I'm going to be able to try and contribute on this any time soon.

Hope this feature gets fixed, it'd be a pretty major boon for our team to be able to both use retries with timer triggers and also use the rewind functionality.

I'll update again later if/when I ever get the time to try and contribute on this.

Michson07 commented 1 week ago

Hi, any updates? Looks like the topic has been dead for a year, although the feature was pretty great inProcess mode. Do you know if there is a workaround to restart the last step for a failed instance?