Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

Exception when Continue-as-New and Termination happens at the same time #2264

Open khairihafsham opened 2 years ago

khairihafsham commented 2 years ago

Description

When Continue-as-New and Termination happens about the same time, exception is thrown System.InvalidOp erationException: Multiple ExecutionCompletedEvent found, potential corruption in state storage

It does depends on luck, but locally, I've managed to reproduce this issue quite often with the shared code below.

Expected behavior

Workflow is terminated

Actual behavior

Exception is thrown and Orchestrator retries exponentially to process both messages. The exception below is from my test using Durable Functions extension version 2.8.0

TaskOrchestrationDispatcher-bbbf312ef1524cf581b3df919d1ffef1-0: Unhandled exception with work item '03ebaed11b1442a8af3f79e35fb5577c': System.InvalidOperationException: Multiple ExecutionCompletedEvent found, potential corruption in state storage
at DurableTask.Core.OrchestrationRuntimeState.SetMarkerEvents(HistoryEvent historyEvent) in /_/src/DurableTask.Core/OrchestrationRuntimeState.cs:line 254
at DurableTask.Core.TaskOrchestrationDispatcher.ProcessWorkflowCompletedTaskDecision(OrchestrationCompleteOrchestratorAction completeOrchestratorAction, OrchestrationRuntimeState runtimeState, Boolean includeDetails, Boolean& continuedAsNew) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 816
at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 399
at DurableTask.Core.TaskOrchestrationDispatcher.OnProcessWorkItemSessionAsync(TaskOrchestrationWorkItem workItem) in /_/src/DurableTask.Core/TaskOrchestrationDispatcher.cs:line 217
at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/Work
ItemDispatcher.cs:line 459

Relevant source code snippets

Note about the code, the timer value is set very low to increase the chances of reproducing the issue.

using System;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;

using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.DurableTask;
using Microsoft.Azure.WebJobs.Extensions.Http;

namespace Demo
{
    public class DemoInput
    {
        public int Current { get; set; }
        public int Max { get; set; }
    }

    public class DemoOrchestrator
    {

        [FunctionName("DemoStart")]
        public static async Task<HttpResponseMessage> HttpStart(
            [HttpTrigger(AuthorizationLevel.Anonymous, "get", "post")] HttpRequestMessage req,
            [DurableClient] IDurableOrchestrationClient client)
        {
            DemoInput input = await req.Content.ReadAsAsync<DemoInput>();

            string instanceId = await client.StartNewAsync("DemoOrchestrator", input);

            return client.CreateCheckStatusResponse(req, instanceId);
        }

        [FunctionName(nameof(DemoOrchestrator))]
        public async Task ExecuteAsync([OrchestrationTrigger] IDurableOrchestrationContext context)
        {
            try
            {
                DemoInput input = context.GetInput<DemoInput>();
                Console.WriteLine($"Execution {context.InstanceId} {input.Current} out of {input.Max}");

                CancellationTokenSource cts = new CancellationTokenSource();
                await context.CreateTimer(DateTime.UtcNow.AddSeconds(1), cts.Token);

                if (input.Current >= input.Max)
                {
                    Console.WriteLine($"Execution completed {context.InstanceId}");
                    return;
                }

                input.Current++;

                context.ContinueAsNew(input);
            }
            catch (Exception e)
            {
                Console.WriteLine(e);
                throw;
            }
        }
    }
}

Known workarounds

Only tested locally, but triggering termination again via func durable terminate --id uuid manage to terminate the execution

App Details

Screenshots

If applicable, add screenshots to help explain your problem.

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

This issue is reproducible locally. But, here are some info that I can share.

khairihafsham commented 2 years ago

Hi @amdeel , wondering if there is any update regarding this issue? This problem is blocking a big release of our service.

At the moment, we are exploring using External Events as an alternative way to "terminate" an orchestrator's execution. It is not ideal, but it does achieve the goal. Would appreciate if there is other solution that you could advice.

schutztj commented 1 year ago

@davidmrdavid - Do you have any timeline/updates regarding when this change will make it to mainline and into a release? This bug is hammering us in production.

I see it currently sitting here: https://github.com/Azure/durabletask/tree/dajusto/patch-continue-as-new-and-terminate-race

schutztj commented 1 year ago

@davidmrdavid - Still hitting this hundreds of time per day. Could you please give an update as I believe getting this into main is currently on you?