Open david-pw opened 3 months ago
For anyone that comes across this issue - as an interim counter measure we've setup a timer trigger that monitors the status of the orchestration. This has been configured to run every 15 minutes, which is half the host function timeout time "functionTimeout": "00:30:00"
If the orchestration instance is in a non-running state, it'll simply restart the orchestration for us. We also had to switch the orchestration pattern to a singleton (fixed instance ID) to ensure that no more than one orchestration process was running at any given time. It's been running for a day now, and has auto recovered twice.
Timer Function
[Function(nameof(Monitor_Orchestrator_SyncLogDataToDataLake))]
public async Task Monitor_Orchestrator_SyncLogDataToDataLake(
[TimerTrigger("0 */15 * * * *")] TimerInfo timerInfo, [DurableClient] DurableTaskClient client,
FunctionContext executionContext)
{
var logger = executionContext.GetLogger(nameof(Monitor_Orchestrator_SyncLogDataToDataLake));
var lastRunUtcIfNew = DateTime.UtcNow.AddHours(-6);
await OrchestrationUtilities.ScheduleOrchestratorIfNotRunning(client,
nameof(Orchestrator_SyncLogDataToDataLake), SingletonInstanceId, lastRunUtcIfNew, logger);
}
Helper
public static async Task<string> ScheduleOrchestratorIfNotRunning(DurableTaskClient client,
TaskName orchestratorName, string instanceId, object? input = null, ILogger? logger = null,
CancellationToken cancellationToken = default)
{
var existingInstance = await client.GetInstanceAsync(instanceId, cancellation: cancellationToken);
if (existingInstance is not
{
RuntimeStatus: not OrchestrationRuntimeStatus.Completed
and not OrchestrationRuntimeStatus.Failed
and not OrchestrationRuntimeStatus.Terminated
})
{
// An instance with the specified ID doesn't exist or an existing one stopped running, create one.
instanceId = await client.ScheduleNewOrchestrationInstanceAsync(orchestratorName, input,
new StartOrchestrationOptions(instanceId), cancellationToken);
logger?.LogInformation("Started orchestration with ID = '{instanceId}'.", instanceId);
}
else
{
instanceId = existingInstance.InstanceId;
logger?.LogInformation("Orchestration '{instance}' status {status}.", instanceId,
existingInstance.RuntimeStatus.ToString());
}
return instanceId;
}
Thanks for raising the issue and posting the workaround, David! We'll take a look on our end as well.
FYI @cgillum
Description
We've been running an eternal orchestration pattern in Aus East for the last couple weeks. We noticed that occasionally, the orchestration would never reach or execute
context.ContinueAsNew
when we had the"functionTimeout": -1
value set.To help diagnose the issue, we set
"functionTimeout": "00:30:00"
suspecting it would be a runaway activity, however I was surprised to see that theOrchestrationTrigger
itself was timing out. Typically this orchestration will run for about 10-15 minutes, and execute eternally for 5-8+ hours before hanging like this.We're unsure how to diagnose, or address this issue further.
Interestingly, when we restarted the Flex Consumption Plan when
"functionTimeout": -1
the orchestration activity would continue as expected.Expected behavior
I expect that the Orchestration Trigger shouldn't timeout, as this avoids any recovery logic we have to self heal the process.
Actual behavior
All activities run to completion, however when attempting to continue using
context.ContinueAsNew
the orchestration trigger times out.Relevant source code snippets
host.json
Known workarounds
Setting
"functionTimeout": -1
and restarting the Flex Consumption Plan app when we detect that the process has frozen, not really a work around.App Details
Note: Ranges added as we update day to day.
"Microsoft.Extensions.Azure": "1.7.4 - 1.7.5"
"Microsoft.Azure.Functions.Worker": "1.23.0"
"Microsoft.Azure.Functions.Worker.Sdk": "1.17.4"
"Microsoft.Azure.Functions.Worker.Extensions.DurableTask": "1.1.4 - 1.1.5"
.NET 8 Isolated
C#
Linux Flex Consumption Plan (Preview)
Screenshots
Instance History: Unsure if related, however I have consistently seen the sentinel RowKey right before the
OrchestratorCompleted
event when the process has frozen like this...If deployed to Azure
2024-08-13T15:47:23.7057963Z
-2024-08-13T16:47:23.7057963Z
pw-prod-dataintegration
Orchestrator_SyncLogDataToDataLake
Australia East
71fff80ba19241a69235a8958082cf08
can be provided privately