Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.51k stars 290 forks source link

[WIP] Add extra instrumentation on non-determinism error #912

Open davidmrdavid opened 1 year ago

davidmrdavid commented 1 year ago

Currently just a hotfix instrumentation for: https://github.com/microsoft/durabletask-dotnet/issues/153

Though eventually we may want to enhance our non-determinism error messages to give us more context on the status of the actionsMap during these exceptions.

davidmrdavid commented 1 year ago

This is the expanded error message we can expect to get with these changes:

""" "Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'SayHello' (version ''), but the current replay execution hasn't (yet?) scheduled this task. At the time of this error: the number of pending actions is '0' for the following taskIDs: {}. The number of open tasks is '0 for the following taskIDs: {}. The set of history pastEvents were: {DurableTask.Core.History.OrchestratorStartedEvent,DurableTask.Core.History.ExecutionStartedEvent,DurableTask.Core.History.TaskScheduledEvent,DurableTask.Core.History.OrchestratorCompletedEvent,DurableTask.Core.History.OrchestratorStartedEvent,DurableTask.Core.History.TaskCompletedEvent,DurableTask.Core.History.TaskScheduledEvent,DurableTask.Core.History.OrchestratorCompletedEvent,DurableTask.Core.History.OrchestratorStartedEvent,DurableTask.Core.History.TaskCompletedEvent,DurableTask.Core.History.TaskScheduledEvent,DurableTask.Core.History.OrchestratorCompletedEvent} The set of history newEvents were: {DurableTask.Core.History.OrchestratorStartedEvent,DurableTask.Core.History.ExecutionStartedEvent,DurableTask.Core.History.TaskScheduledEvent,DurableTask.Core.History.OrchestratorCompletedEvent,DurableTask.Core.History.OrchestratorStartedEvent,DurableTask.Core.History.TaskCompletedEvent,DurableTask.Core.History.TaskScheduledEvent,DurableTask.Core.History.OrchestratorCompletedEvent,DurableTask.Core.History.OrchestratorStartedEvent,DurableTask.Core.History.TaskCompletedEvent,DurableTask.Core.History.TaskScheduledEvent,DurableTask.Core.History.OrchestratorCompletedEvent} The list of all actions so far is: {} Was a change made to the orchestrator code after this instance had already started running?", """

jviau commented 1 year ago

Just to be sure - we are not actually going to check this into main correct? If you need to check this into the repo to get a build, please use a feature branch - such as feature/non-determinism-logger.

As for the message, I don't want to hold this up and you can go with what you want. However, instead of building it all up on every action, what I would do is update NonDeterminismException to take in OrchestrationRuntimeState, IEnumerable<OrchestratorAction>, and OrchestrationAction (the last one being the failed action taken), then have its ToString method pretty-print a diff of the two. It is also important to include the contents of the history events / actions and not just their type names (even then, full type names are not needed - can use a shorthand name). I envision an output of something like:

Non-determinism encountered when running orchestration. Actions differ at:

<EventId> <EventShortHandName> <EventParams (ie, Name, Version, etc)> 
.
.
.
<EventId> <EventShortHandName> <EventParams (ie, Name, Version, etc)> 
- EXPECTED: <EventId> <EventShortHandName> <EventParams (ie, Name, Version, etc)>
+ ACTUAL: <EventId> <EventShortHandName> <EventParams ((ie, Name, Version, etc)>
<EventId> <EventShortHandName> <EventParams (ie, Name, Version, etc)> 
.
.
.
<EventId> <EventShortHandName> <EventParams (ie, Name, Version, etc)> 

The above uses diff markdown, which means including - and + at the start of the line to show changes. The event with - is the one we expected from history, the one with + is the new unexpected action (or we can flip that, whatever you want). All events before/after those two lines are the existing history events.

davidmrdavid commented 1 year ago

Just to be sure - we are not actually going to check this into main correct? If you need to check this into the repo to get a build, please use a feature branch - such as feature/non-determinism-logger.

Indeed, definitely not trying to merge this version of the code to main. Mostly just creating this PR for discussion and visibility. If I understand your suggestion correctly, you're asking me to re-target the PR to a feature branch. I can do that in a few.

As for the message, I don't want to hold this up and you can go with what you want. However, instead of building it all up on every action, what I would do is update NonDeterminismException to take in OrchestrationRuntimeState, IEnumerable, and OrchestrationAction (the last one being the failed action taken), [... truncated]

I like the idea of a diff, though that will require some thinking to implement efficiently. I want to follow up on this idea but, for the sake of getting at least some partial information from the customer, I think I'll take you on this being a non-blocking suggestion and release this in a private package. We can then continue iterating on this design :)