Durable Entities stay locked when critical section is left via exception

ranasch commented 9 months ago

Description

Running .net8 isolated durable function which has a critical section guarded by a durable entity. When the orchestration fails, the entity keeps locked and no further orchestrations can proceed.

Expected behavior

The the critical section (using) is left via exception, the build in dispose should release the lock of the orchestration on the entity, otherwise it stays locked forever and no other orchestration can obtain a lock.

Actual behavior

When the critical section is left via exception, the lock is not released on the entity, because the release is not persisted on the Azure Table.

Relevant source code snippets

public static class SayHelloOrchestration
{
    [Function(nameof(SayHelloOrchestration))]
    public static async Task<List<string>> RunOrchestrator([OrchestrationTrigger] TaskOrchestrationContext context)
    {
        var outputs = new List<string>();
        ILogger logger = context.CreateReplaySafeLogger(nameof(SayHelloOrchestration));

        logger.LogInformation($"Saying hello ({context.InstanceId})");

        await using (var handle = await context.AcquireResourceLockAsync(Guid.Empty.ToString()))
        {
            outputs.Add(await context.CallActivityAsync<string>(nameof(SayHelloActivity), $"Brisbane ({context.InstanceId})"));
            throw new Exception("Test");
            outputs.Add(await context.CallActivityAsync<string>(nameof(SayHelloActivity), $"Sydney ({context.InstanceId})"));
        }
        logger.LogInformation($"Saying goodbye ({context.InstanceId})");
        return outputs;
    }
}

The created entity will stay be locked by the critical section, unless you also implement a custom Dispose, which by calling the Entity a final time ensures the unlock state is persisted in the table storage:

    public static class ResourceLockExtensions
    {
        public static async Task<IAsyncDisposable> AcquireResourceLockAsync(this TaskOrchestrationContext context, string instanceKey)
        {
            var entityId = new EntityInstanceId(nameof(ResourceLockEntity), instanceKey);
            var entityHandle = await context.Entities.LockEntitiesAsync(entityId);

            return new ResourceLockHandle(context, entityId, entityHandle);
        }
    }

    public sealed class ResourceLockHandle : IAsyncDisposable
    {
        private readonly TaskOrchestrationContext _context;
        private readonly EntityInstanceId _entityId;
        private readonly IAsyncDisposable _entityHandle;

        public ResourceLockHandle(TaskOrchestrationContext context, EntityInstanceId entityId, IAsyncDisposable entityHandle)
        {
            _context = context;
            _entityId = entityId;
            _entityHandle = entityHandle;
        }

        public async ValueTask DisposeAsync()
        {
            if (!_context.IsReplaying && _context.Entities.InCriticalSection(out var eventIds) && eventIds.Contains(_entityId))
            {
                await _entityHandle
                    .DisposeAsync()
                    .ConfigureAwait(false);

                // this call does the trick!
                await _context.Entities
                    .CallEntityAsync(_entityId, "dummycallforcleanup ");
            }

            GC.SuppressFinalize(this);
        }
    }

    public static class ResourceLockEntity
    {
        [Function(nameof(ResourceLockEntity))]
        public static Task DispatchAsync([EntityTrigger] TaskEntityDispatcher dispatcher) => dispatcher.DispatchAsync(operation => default);
    }

Unfortunately this approach has a catch with throwing ambiguous workflow exception. So its not the solution.

Known workarounds

Tried the custom dispose, however that throws this exception:

[2024-02-13T17:16:55.491Z] Executing 'Functions.SayHelloOrchestration' (Reason='(null)', Id=46f8e4d8-7614-49f2-b803-f20822509981)
[2024-02-13T17:16:55.677Z] An error occurred while executing the orchestrator function 'SayHelloOrchestration'.
[2024-02-13T17:16:55.678Z] Result: An error occurred while executing the orchestrator function 'SayHelloOrchestration'.
Exception: System.Exception: Test
[2024-02-13T17:16:55.679Z]    at FunctionApp2.SayHelloOrchestration.RunOrchestrator(TaskOrchestrationContext context) in C:\Temp\FunctionApp2\SayHelloOrchestration.cs:line 20
[2024-02-13T17:16:55.680Z]    at FunctionApp2.SayHelloOrchestration.RunOrchestrator(TaskOrchestrationContext context) in C:\Temp\FunctionApp2\SayHelloOrchestration.cs:line 22
[2024-02-13T17:16:55.681Z]    at FunctionApp2.DirectFunctionExecutor.ExecuteAsync(FunctionContext context) in C:\Temp\FunctionApp2\Microsoft.Azure.Functions.Worker.Sdk.Generators\Microsoft.Azure.Functions.Worker.Sdk.Generators.FunctionExecutorGenerator\GeneratedFunctionExecutor.g.cs:line 44
[2024-02-13T17:16:55.682Z]    at Microsoft.Azure.Functions.Worker.OutputBindings.OutputBindingsMiddleware.Invoke(FunctionContext context, FunctionExecutionDelegate next) in D:\a\_work\1\s\src\DotNetWorker.Core\OutputBindings\OutputBindingsMiddleware.cs:line 13
[2024-02-13T17:16:55.683Z]    at Microsoft.Azure.Functions.Worker.Extensions.DurableTask.FunctionsOrchestrator.EnsureSynchronousExecution(FunctionContext functionContext, FunctionExecutionDelegate next, FunctionsOrchestrationContext orchestrationContext) in /_/src/Worker.Extensions.DurableTask/FunctionsOrchestrator.cs:line 81
[2024-02-13T17:16:55.684Z]    at Microsoft.Azure.Functions.Worker.Extensions.DurableTask.FunctionsOrchestrator.RunAsync(TaskOrchestrationContext context, Object input) in /_/src/Worker.Extensions.DurableTask/FunctionsOrchestrator.cs:line 51
Stack:    at FunctionApp2.SayHelloOrchestration.RunOrchestrator(TaskOrchestrationContext context) in C:\Temp\FunctionApp2\SayHelloOrchestration.cs:line 20
[2024-02-13T17:16:55.684Z]    at FunctionApp2.SayHelloOrchestration.RunOrchestrator(TaskOrchestrationContext context) in C:\Temp\FunctionApp2\SayHelloOrchestration.cs:line 22
[2024-02-13T17:16:55.685Z]    at FunctionApp2.DirectFunctionExecutor.ExecuteAsync(FunctionContext context) in C:\Temp\FunctionApp2\Microsoft.Azure.Functions.Worker.Sdk.Generators\Microsoft.Azure.Functions.Worker.Sdk.Generators.FunctionExecutorGenerator\GeneratedFunctionExecutor.g.cs:line 44
[2024-02-13T17:16:55.686Z]    at Microsoft.Azure.Functions.Worker.OutputBindings.OutputBindingsMiddleware.Invoke(FunctionContext context, FunctionExecutionDelegate next) in D:\a\_work\1\s\src\DotNetWorker.Core\OutputBindings\OutputBindingsMiddleware.cs:line 13
[2024-02-13T17:16:55.686Z]    at Microsoft.Azure.Functions.Worker.Extensions.DurableTask.FunctionsOrchestrator.EnsureSynchronousExecution(FunctionContext functionContext, FunctionExecutionDelegate next, FunctionsOrchestrationContext orchestrationContext) in /_/src/Worker.Extensions.DurableTask/FunctionsOrchestrator.cs:line 81
[2024-02-13T17:16:55.687Z]    at Microsoft.Azure.Functions.Worker.Extensions.DurableTask.FunctionsOrchestrator.RunAsync(TaskOrchestrationContext context, Object input) in /_/src/Worker.Extensions.DurableTask/FunctionsOrchestrator.cs:line 51.
[2024-02-13T17:16:55.741Z] Executed 'Functions.SayHelloOrchestration' (Failed, Id=46f8e4d8-7614-49f2-b803-f20822509981, Duration=253ms)
[2024-02-13T17:16:55.742Z] System.Private.CoreLib: Exception while executing function: Functions.SayHelloOrchestration. Microsoft.Azure.WebJobs.Extensions.DurableTask: Non-Deterministic workflow detected: A previous execution of this orchestration scheduled a send event task with sequence ID 3, type 'EventSent' name 'op', instance ID '@resourcelockentity@00000000-0000-0000-0000-000000000000', but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running?.
[2024-02-13T17:16:55.747Z] 1f3590b6dbd24205a25b29cf5b497783: Function 'SayHelloOrchestration (Orchestrator)' failed with an error. Reason: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions.SayHelloOrchestration
[2024-02-13T17:16:55.749Z]  ---> DurableTask.Core.Exceptions.OrchestrationFailureException: Non-Deterministic workflow detected: A previous execution of this orchestration scheduled a send event task with sequence ID 3, type 'EventSent' name 'op', instance ID '@resourcelockentity@00000000-0000-0000-0000-000000000000', but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running?
[2024-02-13T17:16:55.751Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.RemoteOrchestratorContext.ThrowIfFailed() in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\RemoteOrchestratorContext.cs:line 58
[2024-02-13T17:16:55.751Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcMiddleware.<>c__DisplayClass10_0.<<CallOrchestratorAsync>b__0>d.MoveNext() in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\OutOfProcMiddleware.cs:line 145[2024-02-13T17:16:55.752Z] --- End of stack trace from previous location ---
[2024-02-13T17:16:55.753Z]    at Microsoft.Azure.WebJobs.Host.Executors.TriggeredFunctionExecutor`1.<>c__DisplayClass7_0.<<TryExecuteAsync>b__0>d.MoveNext() in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\TriggeredFunctionExecutor.cs:line 50
[2024-02-13T17:16:55.754Z] --- End of stack trace from previous location ---
[2024-02-13T17:16:55.754Z]    at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 581
[2024-02-13T17:16:55.755Z]    at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 527
[2024-02-13T17:16:55.756Z]    at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306
[2024-02-13T17:16:55.757Z]    --- End of inner exception stack trace ---
[2024-02-13T17:16:55.757Z]    at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352
[2024-02-13T17:16:55.758Z]    at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108. IsReplay: False. State: Failed. RuntimeStatus: Failed. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.13.1. SequenceNumber: 12. TaskEventId: -1

App Details

Project:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
    <AzureFunctionsVersion>v4</AzureFunctionsVersion>
    <OutputType>Exe</OutputType>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
    <DockerFastModeProjectMountDirectory>/home/site/wwwroot</DockerFastModeProjectMountDirectory>
    <DockerDefaultTargetOS>Linux</DockerDefaultTargetOS>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.21.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.0" />
    <PackageReference Include="Microsoft.ApplicationInsights.WorkerService" Version="2.22.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="1.2.0" />
    <PackageReference Include="Microsoft.VisualStudio.Azure.Containers.Tools.Targets" Version="1.19.6" />
  </ItemGroup>
  <ItemGroup>
    <None Update="host.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
    <None Update="local.settings.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
      <CopyToPublishDirectory>Never</CopyToPublishDirectory>
    </None>
  </ItemGroup>
  <ItemGroup>
    <Using Include="System.Threading.ExecutionContext" Alias="ExecutionContext" />
  </ItemGroup>
</Project>

settings:

{
    "IsEncrypted": false,
  "Values": {
    "AzureWebJobsStorage": "UseDevelopmentStorage=true",
    "FUNCTIONS_WORKER_RUNTIME": "dotnet-isolated"
  }
}

Screenshots

Lock before disposing: Entity unlocked after custom callout in dispose:

If deployed to Azure

This is a simplified excerpt of our production function. For production details reach out to me directly to get a repro.

DEndersby91 commented 9 months ago

Amazing bug report @ranasch, I've been facing the exact same issue and had just made a striped down example and saw this when I came to post.

So it's not a complete waste of time here is the basic example I came up with : Functions V4, All latest relevant packages, .net8 Isolated, tested locally with Azurite.

using System;
using System.Threading.Tasks;
using Microsoft.DurableTask;
using Microsoft.DurableTask.Client;
using Microsoft.DurableTask.Entities;
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Http;

public class EntityLockExample
{
    [Function(nameof(EntityLockExample.EntityLockExampleTrigger))]
    public async Task EntityLockExampleTrigger([HttpTrigger(AuthorizationLevel.Function, "get", Route = $"{nameof(EntityLockExample.EntityLockExampleTrigger)}/{{throwError}}")] HttpRequestData requestData, bool? throwError, [DurableClient] DurableTaskClient durableTaskClient)
    {
        await durableTaskClient.ScheduleNewOrchestrationInstanceAsync(new TaskName(nameof(this.EntityLockExampleOrchestration)), throwError.GetValueOrDefault(false));
    }

    [Function(nameof(EntityLockExample.EntityLockExampleOrchestration))]
    public async Task EntityLockExampleOrchestration([OrchestrationTrigger] TaskOrchestrationContext orchestrationContext, bool throwError)
    {
        EntityInstanceId entityLock = new EntityInstanceId(nameof(EntityLockExample), "myLock");
        await using (await orchestrationContext.Entities.LockEntitiesAsync(entityLock))
        {
            await orchestrationContext.CallActivityAsync(new TaskName(nameof(this.EntityLockExampleActivity)), throwError);
        }
    }

    [Function(nameof(EntityLockExample.EntityLockExampleActivity))]
    public void EntityLockExampleActivity([ActivityTrigger] bool throwError)
    {
        if (throwError)
        {
            throw new Exception("Activity Failed");
        }
    }
}

With this code you can run it as many times as you want passing throwError = false But as soon as you pass one throwError = true the lock will never be released and all other subsequent requests will be held forever. The lock does know more and more tasks are in the queue though so that is getting updated.

I also tried to capture any exceptions inside the critical section and rethrow them outside the using block after the lock had been disposed but this sadly did not help.

Another "workaround" I have at the moment is to never allow an orchestration that uses critical sections to enter a failed state. To do this I capture any exception in the activities and store that in a custom state. This is less than ideal though as using the DFMonitor it looks like everything is running fine when really there may be many issues.

I have read some other tickets related to this issue and people where asking about the Netherite storage engine which I hoped would solve this issue but sadly I can't get this to work with .net8 Isolated. Might be uncouth but here is a link to that issues in the Netherite repo.

Thanks @sebastianburckhardt for already assigning this to yourself, if you have any other ideas or work around that would be ace.

sebastianburckhardt commented 9 months ago

I have been able to confirm that the current implementation does not properly release the lock when an exception is thrown inside a critical section. I have not yet been able to determine why though.

sebastianburckhardt commented 9 months ago

The problem seems to be with how OutOfProcMiddleware.CallOrchestratorAsync is implemented. Specifically, when an orchestrator function fails, the OrchestratorExecutionResult returned by the remote worker is ignored, and instead an alternate OrchestratorExecutionResult.ForFailure is constructed from the exception returned by the function result. This is not a good idea because (a) the original failure details are lost, which contain the actual stack trace of the exception in user code, and (b) any orchestrator actions that took place prior to the completion (e.g. sending lock release messages) are ignored.

I think the solution is to, if there is a failure, first try to properly use the original execution result (if that execution result is also a failure), and only construct a new failure result if the original execution result is not a failure.

sebastianburckhardt commented 9 months ago

PR #2748 contains a fix for this issue and (barring any complications) can go out with the next release.

DEndersby91 commented 6 months ago

Correction, sadly this issue is not resolved I must have tested it incorrectly... My original "EntityLockExample" code can be used to demonstrate this with 1.1.2 Trying to test now with 1.1.3 and my code can't seem to even take a lock out now even after fully resetting azurite just encase something was stuck in a queue.

I can confirm with the latest release on nuget this issue is resolved. Thanks @sebastianburckhardt for working on it.

ranasch commented 6 months ago

@sebastianburckhardt @DEndersby91 Not sure how you confirmed it. When I upgraded Microsoft.Azure.Functions.Worker.Extensions.DurableTask from 1.1.1 to 1.1.3 in my sample above, I don´t get a lock anymore. This call does not return:

            var entityHandle = await context.Entities.LockEntitiesAsync(entityId);

What am I missing?

DEndersby91 commented 6 months ago

You've made me worried now. I tested the above code I linked in my original commnet and it no longer locks up for me after an error. Let me add some delays inside the activity and queue a few up and see if the lock is taken out.

Are you testing with an existing hub that already has a locked up entity? I don't think this fix will correct things historically so some cleaning up might be required.

ranasch commented 6 months ago

You've made me worried now. I tested the above code I linked in my original commnet and it no longer locks up for me after an error. Let me add some delays inside the activity and queue a few up and see if the lock is taken out.

Are you testing with an existing hub that already has a locked up entity? I don't think this fix will correct things historically so some cleaning up might be required.

No - testing locally with Azurite and tried with existing hub as well as discarding/ creating a new hub from scratch. Same result

DEndersby91 commented 6 months ago

You're 100% right @ranasch, I don't know what I was testing before but this issue is sadly not resolved with the current 1.1.2 release. Trying to test now with 1.1.3 and I can't seem to take a lock out even when using a fresh azurite instance.

Ok so reading other issues this is might be caused by entites not getting picked up : https://github.com/Azure/azure-functions-durable-extension/issues/2830

ranasch commented 6 months ago

1.1.4 seems to fix this for me

roend83 commented 1 month ago

This is happening to me in 1.1.4. I'm occasionally getting errors like this

Exception while executing function: Functions.KccConsumerOrchestrationFunction Did not find any initialized language workers

When this happens, the orchestration fails, and entity locks are not released without manual intervention to restart the orchestration.

Azure / azure-functions-durable-extension