Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

Orchestration stuck in Running state #2364

Open thomasrosdahl opened 1 year ago

thomasrosdahl commented 1 year ago

Description

Orchestration stuck in Running state even though execution completed successfully according to History table.

Expected behavior

Orchestration should transition to Completed state after successful execution.

Actual behavior

Orchestration is stuck in Running state and no way to recover without manually deleting the instance from the "Instances" table. TerminateAsyncand RestartAsync do not work.

Relevant source code snippets

Known workarounds

Manually deleting the instance record fromt the "Instances" table.

App Details

Screenshots

image image

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

If you don't want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it's extremely difficult to look up information.

davidmrdavid commented 1 year ago

Hi @thomasrosdahl, thanks for reaching out.

We'll need a bit more information to be able to debug this. One question that comes to mind is: does your affected orchestrator have any pending sub-tasks (sub-orchestrators or Activities) that still need to complete by the time your orchestrator reaches its return statement? If so, that could explain why you see the orchestrator remain as "Running" - we only reach the "Completed" state when all sub-tasks have completed as well.

Additionally, does this occur locally or only on Azure? If it's only on Azure, could you please provide us with your orchestrator's instanceID? Thanks!

thomasrosdahl commented 1 year ago

Hi @davidmrdavid,

The orchestration has one activity: CreateTenantDashboardDataSet. Looking at the attached screen dump from the history table, it looks like it also completed successfully (row 6)? We've only observed it running in Azure and it's not very frequent. However when it does happen it requires manual intervention.

Here's the code for our orchestrator function:

`

    [FunctionName(nameof(BuildTenantDashboard))]
    [Disable("DisableDashboardBuilder")]
    public async Task BuildTenantDashboard(
        [OrchestrationTrigger] IDurableOrchestrationContext context)
    {
        var retryOptions = new RetryOptions(TimeSpan.FromMinutes(1), 10)
        {
            BackoffCoefficient = 2,
            MaxRetryInterval = TimeSpan.FromMinutes(10)
        };

        var tenantId = context.GetInput<string>();
        await context.CallActivityWithRetryAsync(nameof(CreateTenantDashboardDataSet), retryOptions, tenantId);
    }

    [FunctionName(nameof(CreateTenantDashboardDataSet))]
    [Disable("DisableDashboardBuilder")]
    public async Task CreateTenantDashboardDataSet(
        [ActivityTrigger] string tenantId,
        [Table("DashboardData", Connection = "StorageConnection")] CloudTable table,
        [Blob("operations", Connection = "StorageConnection")] CloudBlobContainer blobContainer)
    {
        await _dashboardBuilderService.BuildAsync(tenantId, table, blobContainer);
    }

`

Do you have an email where I can send the orchestration ID?

Thanks!

davidmrdavid commented 1 year ago

@thomasrosdahl - you can reach me at . Please ping me here once you've emailed me so that I may redact my email :P

thomasrosdahl commented 1 year ago

@davidmrdavid, you've got mail sir!

davidmrdavid commented 1 year ago

hey @thomasrosdahl, I'm just posting here for visibility that we've been discussing this issue directly via email. Did you get a chance to consider Netherite or MSSQL as alternative backends to circumvent this issue?

thomasrosdahl commented 1 year ago

@davidmrdavid Not yet. It would introduce additional moving parts for us which we'd prefer to avoid if possible. Any ETA on the fix for the Azure Storage backend?

Thanks!

davidmrdavid commented 1 year ago

The root problem will take time to fix, but we're discussing a few tactical fixes that could be executed faster. I can't provide a concrete ETA just yet, but I plan to link a PR here once we have a prototype fix.

I'll aim keep this thread posted as updates come.