Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

RetryOptions error-handler in Orchestration cannot capture right exception-type from Sub-Orchestration #926

Open Nabakamal opened 11 months ago

Nabakamal commented 11 months ago

My question is related to 436 and 807- this is explained below:

Problem: Cannot get the right exception (via InnerException or FailureDetails) in the exception handler of RetryOptions (Handle)

Scenario: I have an Orchestration that calls a sub-orchestration using RetryOptions, like:

RetryOptions retryOptions = new RetryOptions(TimeSpan.FromMilliseconds(2500),3){
Handle = e =>
{
               /*
                    This exception never captures MyCustomException, which was originally thrown, 
                    either in FailureDetails or the InnerException property
              */

        SubOrchestrationFailedException tfe = e as SubOrchestrationFailedException;

        if (tfe != null && tfe.InnerException != null)
        {
            e = tfe.InnerException;
        }

        MyCustomException ce = e as MyCustomException; 
        if (ce != null)
        {
            LatestException = ce;  //LatestException is a variable of type MyCustomException
            return true;
        }
        return false;
    }};
    // IPayload is a custom type that is supposed to be returned from my sub-orchestration(if it ran successfully) 
    return await context.CreateSubOrchestrationInstanceWithRetry<IPayload>(typeof(FetchRatesSubOrchestration), retryOptions, 1);

The sub-orchestration's RunTask() looks like:

public override async Task<IPayload> RunTask(OrchestrationContext context, object input)
{
    List<AssetDTO> rates = new List<AssetDTO>();
    Reference1 reference1 = await context.ScheduleTask<Reference1>(typeof(TaskActivity1));
    await context.ScheduleTask<bool>(typeof(TaskActivity2), reference1);
    rates = await context.ScheduleTask<List<AssetDTO>>(typeof(TaskActivity3), reference1);
    List<string> datasReceived = rates.Select(x => x.TickerName).ToList();
    List<string> validDataPoints = _dbContext.SourceKeys.Select(t => t.SourceKeyValue).ToList();
    List<string> missingDataPoints = datasReceived.Except(validDataPoints).ToList();
    if (missingDataPoints.Count() > 0)
    {
        _logger.LogError($"Requested data points {string.Join(",", missingDataPoints)} were not returned in the response. Retrying.");
        throw new MyCustomException($"Requested data points {string.Join(",", missingDataPoints)} were not returned in the response. Retrying.");

        //OrchestrationFailureException innerExceptionToThrow = new OrchestrationFailureException($"Requested data points {string.Join(",", missingDataPoints)} were not returned in the response. Retrying.");
        //throw innerExceptionToThrow;
        // var exc =  new DurableTask.Core.Exceptions.SubOrchestrationFailedException($"Orchestration failed - {GetType().Name}", innerExceptionToThrow);
        // exc.FailureDetails
        // throw exc;
    }
    return new RatesAvailablePayload(rates);
}

The error thrown within the if-block in the sub-orchestration never bubbles-up to the Handle error-handler of the defined RetryOptions (within the parent orchestration).

Additionally, what else have I tried? a. Setting the ErrorPropagationMode to ErrorPropagationMode.UseFailureDetails (or ErrorPropagationMode.SerializeExceptions) doesn't help - the FailureDetails object of the exception and the InnerException property are always null.

b. I have tried debugging against the source of the framework, and, I believe to have seen the "Multiple ExecutionCompletedEvent found, potential corruption in state storage" message from the SerMarkerEvents() method in DurableTask.Core.OrchestrationRuntimeState class - but that was during one of the times I was debugging it.

c. While debugging the framework source, I noticed that the FailureDetails object is populated, for the most part, but due to the parallel-execution of the code it is a little difficult to debug as the control jumps from one class to the other, and then to the third - which makes following a trail difficult, at best

d. I have looked at most, if not all, of the tests in the DurableTask and the DurableTask-MSSQL repositories, but these haven't helped me in figuring out what I may still be missing

I'm using the following packages:

Microsoft.DurableTask.SqlServer - 1.1.1 Microsoft.Azure.DurableTask.Core - 2.10.0 (dependency brought in via Microsoft.DurableTask.SqlServer)

Please advise on how to get this resolved.

Thank you.

Nabakamal commented 11 months ago

@cgillum @jviau @papigers Can you please advise, when time permits? Thank you.