Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.51k stars 290 forks source link

Multiple levels of SubOrchestrations - an error gets thrown when reading a SubOrchestrationCompleteEvent of a nested one #1011

Open preslavnd opened 10 months ago

preslavnd commented 10 months ago

The situation I end up with is I have an Orchestration, which creates and waits on a SubOrchestration, which does also create and wait on another SubOrchestration.

When the inner-most one completes, I get the following exception:

Backing off for 1 seconds until 5 successful operations
      System.InvalidCastException: Unable to cast object of type 'DurableTask.Core.History.ExecutionStartedEvent' to type 'DurableTask.Core.History.SubOrchestrationInstanceCreatedEvent'.
         at DurableTask.Core.TaskOrchestrationDispatcher.ReconcileMessagesWithState(TaskOrchestrationWorkItem workItem, String dispatcher, ErrorPropagationMode errorPropagationMode, LogHelper logHelper) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 844
         at DurableTask.Core.TaskOrchestrationDispatcher.<OnProcessWorkItemAsync>d__25.MoveNext() in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 345
      --- End of stack trace from previous location where exception was thrown ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
         at DurableTask.Core.TaskOrchestrationDispatcher.<OnProcessWorkItemAsync>d__25.MoveNext() in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 594
      --- End of stack trace from previous location where exception was thrown ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
         at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
         at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
         at DurableTask.Core.TaskOrchestrationDispatcher.<OnProcessWorkItemSessionAsync>d__24.MoveNext() in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 211
      --- End of stack trace from previous location where exception was thrown ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
         at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
         at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
         at DurableTask.Core.WorkItemDispatcher`1.<ProcessWorkItemAsync>d__45.MoveNext() in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 375

What happens next is, looking at the database data (I am using SqlServer provider), the SubOrchestrationCompleted event of the inner-most orchestration is lost (NewEvents does not contain it, nor it gets applied to the instance Histories is not showing it being received). A SubOrchestrationCompleted event for the parent is waiting inside NewEvents but cannot be processed as well and generates this exception everytime the application is trying to pick new events. By looking at the code, it seems this error probably comes from

if (Activity.Current != null)
                {
                    HistoryEvent historyEvent = message.Event;
                    if (historyEvent is TimerFiredEvent timerFiredEvent)
                    {
                        // We immediately publish the activity span for this timer by creating the activity and immediately calling Dispose() on it.
                        TraceHelper.EmitTraceActivityForTimer(workItem.OrchestrationRuntimeState.OrchestrationInstance, workItem.OrchestrationRuntimeState.Name, message.Event.Timestamp, timerFiredEvent);
                    }
                    else if (historyEvent is SubOrchestrationInstanceCompletedEvent subOrchestrationInstanceCompletedEvent)
                    {
                        SubOrchestrationInstanceCreatedEvent subOrchestrationCreatedEvent = (SubOrchestrationInstanceCreatedEvent)workItem.OrchestrationRuntimeState.Events.FirstOrDefault(x => x.EventId == subOrchestrationInstanceCompletedEvent.TaskScheduledId);

                        // We immediately publish the activity span for this sub-orchestration by creating the activity and immediately calling Dispose() on it.
                        TraceHelper.EmitTraceActivityForSubOrchestrationCompleted(workItem.OrchestrationRuntimeState.OrchestrationInstance, subOrchestrationCreatedEvent);
                    }
                    else if (historyEvent is SubOrchestrationInstanceFailedEvent subOrchestrationInstanceFailedEvent)
                    {
                        SubOrchestrationInstanceCreatedEvent subOrchestrationCreatedEvent = (SubOrchestrationInstanceCreatedEvent)workItem.OrchestrationRuntimeState.Events.FirstOrDefault(x => x.EventId == subOrchestrationInstanceFailedEvent.TaskScheduledId);

                        // We immediately publish the activity span for this sub-orchestration by creating the activity and immediately calling Dispose() on it.
                        TraceHelper.EmitTraceActivityForSubOrchestrationFailed(workItem.OrchestrationRuntimeState.OrchestrationInstance, subOrchestrationCreatedEvent, subOrchestrationInstanceFailedEvent, errorPropagationMode);
                    }
                }

where a wrong event type SubOrchestrationInstanceCreatedEvent is assumed to be returned? The end result is that the Orchestration and it's SubOrchestration are stuck and can never progress/finish.