Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
713 stars 267 forks source link

Root cause exception is not exposed in durable function orchestrator #2476

Open torepaulsson opened 1 year ago

torepaulsson commented 1 year ago

Description

This might be more of an oversight or missing for some other reason, anyways here it is:

Throwing an exception in an activity function causes a DurableTask.Core.Exceptions.TaskFailedException exception. I cannot find a way to extract the root exception from this exception type. Is it supposed to be there? All I get is a message and a type, but the type is an AggregateException.

Also, the documentation here https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-error-handling?tabs=csharp-isolated#errors-in-activity-functions states that it should be a FunctionFailedException exception. When I used inprocess durable functions this exception was thrown and it was possible to extract the root exception. In isolated the documentation is wrong.

Expected behavior

The exception thrown contains the root exception

Actual behavior

The exception thrown containsonly the message of the root exception

App Details

cgillum commented 1 year ago

@torepaulsson does this issue occur if you upgrade to the latest version of Microsoft.Azure.Functions.Worker.Extensions.DurableTask (v1.0.2)?

FYI @jviau

jviau commented 1 year ago

I don't recall 1.0.2 having any fixes related to this (but I may be wrong). I think what @torepaulsson is touching on is that in dotnet-isolated we switched to a failure-details only mode, and explicitly do not attempt to serializing/deserialize the original exception across the boundary. We do have a TODO to populate nested exceptions, but I think I want to look into this experience a bit more, I am not entirely sold it was the right call to remove exception propagation across the serialization boundary.

As for AggregateException - are you calling any Task.Result or Task.Wait APIs? Those will cause any exception to be surfaced as AggregateException.

cgillum commented 1 year ago

I'm curious to understand this better:

Throwing an exception in an activity function causes a DurableTask.Core.Exceptions.TaskFailedException exception.

There shouldn't be any DurableTask.Core exceptions surfaced to your code. If you're seeing this, then it's probably something that needs to be fixed. I've seen similar reports from Dapr users as well, which makes me think our abstraction may still be a bit leaky in some places.

torepaulsson commented 1 year ago

@jviau I believe it has to do with the serialization as you say, and if you remember that it is removed then that is probably it!

@cgillum, DurableTask.Core comes as an inner exception from the aggregate exception. image

[Function(nameof(OrchestrateAsync))]
public async Task OrchestrateAsync([OrchestrationTrigger] TaskOrchestrationContext context)
{
  var replaySafeLogger = context.CreateReplaySafeLogger<Orchestrator>();

  try
  {
    await context.CallActivityAsync(nameof(ThrowExceptionActivityAsync), context.InstanceId);
  }
  catch (Exception e)
  {
    replaySafeLogger.LogError(e, "Orchestrator failed");
    throw;  
  }
}

[Function(nameof(ThrowExceptionActivityAsync))]
public Task ThrowExceptionActivityAsync([ActivityTrigger] string instanceId)
{
  throw new InvalidOperationException("Just a test!");
}

@jviau, in the above code i get an AggregateException but I did not use Task.Result or Task.Wait anywhere here, maybe there is a call to that method inside the code that handles the activities? In the CallActivityAsync maybe?

Packages used

<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.14.1" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.10.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.0.13" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.0.2" />
<PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.9.5" />
<PackageReference Include="Microsoft.Extensions.DependencyInjection" Version="7.0.0" />
jviau commented 1 year ago

For AggregateException, I believe this is the cause: https://github.com/Azure/azure-functions-dotnet-worker/issues/993.

Will need to see if we can even work around this, given it is an issue with the dotnet worker and not durable.

wwalkley commented 9 months ago

After migrating to the Isolated worker process, we've encountered the same issue with handling exceptions. The FunctionFailedException is no longer accessible in the latest worker packages. Despite extensive searching, there's no clear documentation or online resources addressing this issue.

The documentation states that exceptions thrown in activity functions should be marshaled back to the orchestrator function as FunctionFailedException. However, in our case, they are propagating up as TaskFailedException. This behavior differs from the expected outcome, as outlined in this open ticket.

We kindly request an update on the progress towards addressing this issue. The inability to utilize FunctionFailedException is hindering our ability to effectively manage exceptions in the Isolated worker process. A prompt resolution would be greatly appreciated.

Example of TaskFailedException surfacing image

Code changes we had to make when moving to an Isolated worker. image

irhad-ljubcic commented 9 months ago

Any updates on this ?

eholman commented 6 months ago

I'm having an isolated/out-of-process project and it's still an issue. When an exception is caught in the Activity itself then you have to/can handle it yourself (e.g. log it)

Also. the RetryPolicy like the docs are mentioning aren't working either?

    <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.21.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask.SqlServer" Version="1.2.2" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.OpenApi" Version="1.5.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.17.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Sql" Version="3.0.534" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.2" />
cgillum commented 6 months ago

@eholman @irhad-ljubcic various fixes have been made over the last few months, including a couple in the last week or so which will hopefully get released soon:

Also. the RetryPolicy like the docs are mentioning aren't working either?

In my testing, custom retry handlers have been working fine, so we'll need more details about what specifically isn't working for you.

eholman commented 6 months ago

This is the policy I tried according to the docs:

public static TaskOptions TryPolicy
{
    get
    {
        TaskOptions retryOptions = TaskOptions.FromRetryHandler(retryContext =>
        {
            // Don't retry anything that derives from ApplicationException
            if (retryContext.LastFailure.IsCausedBy<ApplicationException>())
            {
                return false;
            }

            // Quit after N attempts
            return retryContext.LastAttemptNumber < 3;
        });

        return retryOptions;
    }
}

I have a FluentValidator in the MediatR pipeline which throws an ValidationException inside my activity:

[Function(nameof(ProcessImportData))]
public static async Task<IEnumerable<Guid>> ProcessImportData(
  [ActivityTrigger] OrchestrationData<Metadata.Import, List<Guid>> orchestrationData, FunctionContext executionContext)
{
  var sender = executionContext.InstanceServices.GetRequiredService<ISender>();

  var orderIds = await sender.Send(new ProcessImportDataCommand(orchestrationData.Metadata.ImportData));
  return orderIds;
}

image

WIth this setup, the retry policy kicks in but I expect at least some info regarding the ValidationException or am I wrong here? image

Within the orchestrationTrigger method I'm triggering the activity like:

await context.CallActivityAsync<IEnumerable<Guid>>(nameof(Activities.Import.ProcessImportData), orchestrationData,
    RetryPolicies.Import.TryPolicy)

I've just tried with the pre-release version 1.2.0-rc.1 of Microsoft.Azure.Functions.Worker.Extensions.DurableTask

cgillum commented 6 months ago

@eholman Okay, so it sounds like the retry handler is getting triggered, so retry handlers per-se are not the problem. Rather, the problem is that the exception message you're getting from LastFailure is not what you expected. I wanted to clarify this because the exception information is not specific to retry handlers. It's the same information you'd get if you caught a TaskFailedException in regular orchestrator code, which I assume would be having the exact same problem.

Can you share the contents of your .csproj? I recall there was a .NET Isolated worker bug where ordinary exceptions were being misinterpreted as AggregateException, which I assume is the problem you're seeing. This happens outside of the Durable Functions extension - we just surface what's given to us by the .NET Isolated worker. There was a fix and I'd like to understand if you're using the latest .NET Isolated worker SDKs.

Regardless, I suspect there's an issue where Durable isn't capturing inner exceptions. This is something we can continue to look into.

eholman commented 5 months ago

Thanks @cgillum; apologies for the late response!

Yes, that's correct; the retry handler isn't the issue per se, I understand that it would be the same information outside of it. It was an use case where it is very useful to know the underlying exception, at least with respect to a ValidationException which doesn't need to be retried.

Here is the csproj:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net6.0</TargetFramework>
    <AzureFunctionsVersion>v4</AzureFunctionsVersion>
    <OutputType>Exe</OutputType>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
    <LangVersion>preview</LangVersion>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="JsonSchema.Net.Generation" Version="4.1.1" />
    <PackageReference Include="Microsoft.ApplicationInsights" Version="2.22.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.21.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.2.0-rc.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask.SqlServer" Version="1.2.2" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.OpenApi" Version="1.5.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.17.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Sql" Version="3.0.534" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.2" />
    <PackageReference Include="Microsoft.EntityFrameworkCore.Design" Version="7.0.11">
      <PrivateAssets>all</PrivateAssets>
      <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
    </PackageReference>
    <PackageReference Include="Serilog" Version="3.1.1" />
    <PackageReference Include="Serilog.Enrichers.CorrelationId" Version="3.0.1" />
    <PackageReference Include="Serilog.Exceptions" Version="8.4.0" />
    <PackageReference Include="Serilog.Extensions.Logging" Version="8.0.0" />
    <PackageReference Include="Serilog.Sinks.ApplicationInsights" Version="4.0.0" />
    <PackageReference Include="Serilog.Sinks.Console" Version="5.0.1" />
    <PackageReference Include="Serilog.Sinks.Http" Version="8.0.0" />
  </ItemGroup>
  <ItemGroup>
    <ProjectReference Include="..\xx.Infrastructure\xx.Infrastructure.csproj" />
  </ItemGroup>
  <ItemGroup>
    <None Update="host.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
    <None Update="local.settings.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
      <CopyToPublishDirectory>Never</CopyToPublishDirectory>
    </None>
  </ItemGroup>
  <ItemGroup>
    <Using Include="System.Threading.ExecutionContext" Alias="ExecutionContext" />
  </ItemGroup>
</Project>
cgillum commented 5 months ago

I'm working on some fixes which will address the missing inner exception problem. Stay tuned for more updates.

clrockwell commented 2 months ago

I'm working on some fixes which will address the missing inner exception problem. Stay tuned for more updates.

Were there any updates on this? There seem to be a few issues referencing similar problems. My project is using <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.4" /> and I have enabled EnableUserCodeException in my HostBuilder.

services.Configure<WorkerOptions>(options =>
        {
            options.EnableUserCodeException = true;
        });

However, I'm still not seeing the details of my custom exception that is thrown from the sub orchestrator: image

My exception is this:

public class SO_QueueQuestionsForLearnerQueueItemsException : Exception
    {
        public SO_QueueQuestionsForLearnerQueueItemsFailureStep FailureStep { get; set; }

        public LearnerQuestionQueueItem QueueItem { get; set; }

        public SO_QueueQuestionsForLearnerQueueItemsException()
        {
        }

        public SO_QueueQuestionsForLearnerQueueItemsException(string message, SO_QueueQuestionsForLearnerQueueItemsFailureStep failureStep, LearnerQuestionQueueItem item)
            : base(message)
        {
            FailureStep = failureStep;
            QueueItem = item;
        }

        public SO_QueueQuestionsForLearnerQueueItemsException(string message, Exception inner, SO_QueueQuestionsForLearnerQueueItemsFailureStep failureStep, LearnerQuestionQueueItem item)
            : base(message, inner)
        {
            FailureStep = failureStep;
            QueueItem = item;
        }
    }
TimoWilhelm commented 1 month ago

We're also seeing the same behavior of our Durable Functions Activity Functions only reporting AggregateExceptions in the FailueDetails so we are unable to identify them in the Orchestrator.

We've confirmed that the Activity itself throws the correct Exception but the root cause of the AggregateExceptions seems to be the internal .Result call in the ContinueWith section of the DefaultFunctionInvoker in dotnet worker here:

https://github.com/Azure/azure-functions-dotnet-worker/blob/main/src/DotNetWorker.Core/Invocation/DefaultFunctionInvoker.cs#L32

I would have expected the generated DirectFunctionExecutor to skip the DefaultFunctionExecutor and just execute the Activity Function but that doesn't seem to be the case.

image

@cgillum I see there were some changes to the DefaultFunctionExecutor class related to Durable Functions but it has been quite some time ago. Could this be related?