Open cgillum opened 2 years ago
Chris, good to see we want to officially support it in the DT Core. Do you have any ETA for this work?
hi, any update on the progress. Is this something straightforward enough that community can contribute
No ETA currently. There's general agreement that we should do this, but we haven't been able to prioritize it highly enough in our team backlog yet. There are also a few design questions that would need to be resolved, such as how to deal with failure retries and timers.
We're always open to community contributions. However, I don't think it will be especially straightforward. At this point, any potential contribution would need to first include a design proposal that can be agreed upon by the project maintainers.
Just wanted to float the following super simple/rough idea as a sanity check. Just illustrating a high level view of what might occur when the rewind/rerun/revive is invoked - not implementation details.
'Rewind' doesn't imply whether the supplied event id (if the consumer chooses to supply one) is inclusive (vs exclusive), so I thought that maybe the name "RerunFromEvent" or some such might imply that the supplied event id is going to be removed.
// on IOrchestrationService
Task RerunFromEventAsync(string orchestrationInstanceId, string? startingWithEventId = null, string? archivePartitionSuffix = null) {
...
}
graph TD;
invoke(User Invokes RerunFromEventAsync)
isSuffixSupplied{archivePartitionSuffix supplied?}
copyHistory[Copy history appending\nsupplied suffix to\norchestration instance id\nfor the copy's instance id]
isEventIdSupplied{Is a specific event id supplied?}
eventXIsSuppliedEvent[Event X is supplied event id]
eventXIsLastFailedEvent[Event X is last failed event id]
deleteAllHistoryFromEvent[Delete all history\nfrom event X\n and onwards for\nthe orchestration instance ID\nprovided]
runOrchestration[Invoke orchestration so\nthat it picks up\nfrom the last event]
invoke --> isSuffixSupplied;
isSuffixSupplied -- yes --> copyHistory --> isEventIdSupplied;
isSuffixSupplied -- no --> isEventIdSupplied;
isEventIdSupplied -- yes --> eventXIsSuppliedEvent --> deleteAllHistoryFromEvent;
isEventIdSupplied -- no --> eventXIsLastFailedEvent --> deleteAllHistoryFromEvent;
deleteAllHistoryFromEvent --> runOrchestration;
Any feedback?
I think this tackles dealing w/ retries and timers. Am I vastly underestimating the complexity here? Is deleting history sacrilegious?
When I think of event sourcing I think of a rewind as a very "hard" operation not to be taken lightly by the rewinder. AKA:
You are only doing this because things got seriously borked. So let us go ahead and cut off the history (optionally archiving it beforehand) before triggering a replay.
We were in need of a solution and therefore came up with our own. I know this won't be directly transferable to an official implementation, but maybe it helps in finding and evaluating ideas.
What we're basically doing is keeping our own orchestrator history (only where necessary) as the official one is proprietary (i.e. undocumented, doesn't provide an API, can change). Then, when something goes wrong we discard the official history and restart the orchestrator (indirectly incl. sub-orchestrators it's creating) from scratch.
All activities that are inherently idempotent won't run unintended side-effects by themselves. Other activities and incoming real external events (that aren't sent by the orchestrators themselves) are handled by our custom history (aka. idempotency layer). Since our idempotency layer only records successful interactions (not perfect idempotency, I know), this mechanism results in the orchestrator resuming to the state where it left of before an error occurred.
@boylec Intuitively, I think your approach could be fine, maybe a bit low-level because when writing orchestator code you're not dealing with the underlying events in the history. Maybe there'd be merit providing a proper API on top of the history first.
We were in need of a solution and therefore came up with our own. I know this won't be directly transferable to an official implementation, but maybe it helps in finding and evaluating ideas.
Error handling based on custom idempotency layer What we're basically doing is keeping our own orchestrator history (only where necessary) as the official one is proprietary (i.e. undocumented, doesn't provide an API, can change). Then, when something goes wrong we discard the official history and restart the orchestrator (indirectly incl. sub-orchestrators it's creating) from scratch.
All activities that are inherently idempotent won't run unintended side-effects by themselves. Other activities and incoming real external events (that aren't sent by the orchestrators themselves) are handled by our custom history (aka. idempotency layer). Since our idempotency layer only records successful interactions (not perfect idempotency, I know), this mechanism results in the orchestrator resuming to the state where it left of before an error occurred.
@boylec Intuitively, I think your approach could be fine, maybe a bit low-level because when writing orchestator code you're not dealing with the underlying events in the history. Maybe there'd be merit providing a proper API on top of the history first.
Thanks for feedback. Unfortunately I've been slammed enough where I don't think I'm going to be able to try and contribute on this any time soon.
Hope this feature gets fixed, it'd be a pretty major boon for our team to be able to both use retries with timer triggers and also use the rewind functionality.
I'll update again later if/when I ever get the time to try and contribute on this.
Hi, any updates? Looks like the topic has been dead for a year, although the feature was pretty great inProcess mode. Do you know if there is a workaround to restart the last step for a failed instance?
@boylec Is this still on the radar?
Problem
There are cases when an orchestrator instance can fail due to a bug, an environment issue, etc. In such cases, the orchestration fails, and the process needs to be manually restarted from scratch. This is not ideal because often times the process involved several complex steps, human interactions, etc., and it's not always practical to rebuild that state.
Proposal
The durable task framework uses an event sourcing model under the hood to manage state. This means that in theory it should be possible to "rewind" back to a previously known good state and restart execution from there. Exposing this as a feature would be a great way to recover failed orchestrations (after the underlying issue has been fixed).
Design notes
Rewind should work on failed orchestrations (the design could potentially be extended to terminated and completed orchestrations as well). Internally this API enqueues a message that gets picked up by a worker. When the
TaskOrchestrationDispatcher
receives the message, it will update the in-memory history to remove the last failure and then replay the orchestration with the updated history.Note that in order for this to work, the storage backends must be willing to process messages for orchestrations in a failed state. This may not be the case today, in which case an alternate design could be to expose a new method on
IOrchestrationService
for this functionality. However, this has two problems: 1) it still requires existing backends to make changes to support it, 2) new backends will also be required to implement this new capability explicitly, and 3) it won't be feasible to trigger this from a client operation, sinceTaskHubClient
doesn't have direct access toIOrchestrationService
methods.Prior work
This was done originally for Durable Functions but was never directly exposed to DTFx. Also, the core logic was only implemented for the Azure Storage backend (DurableTask.AzureStorage).
To improve on the prior work, we should do two things:
Implementation notes
As part of making this a generic feature, we should replace the DurableTask.AzureStorage implementation with a generic implementation.