Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
713 stars 268 forks source link

Eternal Function does not restart #1055

Closed ghost closed 4 years ago

ghost commented 4 years ago

Description

I have an eternal function that, after about 2.5 hours (roughly two loops of the parent eternal function) of running successfully, does not appear to be restarting on completion. I checked my storage account for the instance status of the hanging thread, and the instance status is currently marked as Running, but there are no matching tasks in the task hub storage queues.

Expected behavior

Eternal function should continue executing when restarting.

Actual behavior

Function stops executing.

Relevant source code snippets

        [FunctionName(nameof(FileOrchestrator))]
        public static async Task FileOrchestrator([OrchestrationTrigger] IDurableOrchestrationContext context)
        {
            if (context is null)
            {
                throw new ArgumentNullException(nameof(context));
            }

            (AdlsFeed comb, DispatchInstance dispatch) = context.GetInput<(AdlsFeed comb, DispatchInstance dispatch)>();
            (Source source, Feed feed) = (comb.Source, comb.Feed);

            if (dispatch is null)
            {
                throw new ArgumentException("Dispatch object cannot be null.");
            }

            // Pull entries until new hourly file is generated
            bool fileComplete = AdlsStreamReaderHelper.GetDateFileString(dispatch.FileTime.Value) !=
                AdlsStreamReaderHelper.GetDateFileString(context.CurrentUtcDateTime);

            // Start next run timer before starting run
            DateTime nextrun = context.CurrentUtcDateTime.AddSeconds(FHConfig.WebJobPollWaitSecs);
            (StringBuilder lineBuffer, long filePosition) =
                await context.CallActivityAsync<(StringBuilder lineBuffer, long filePosition)>(nameof(PullNewEntries), (source, feed, dispatch))
                .ConfigureAwait(true);

            // Update instance state on stateful entity and local copy
            EntityId entityId = new EntityId(nameof(DispatchInstance), DispatchInstance.GenerateEntityKey(source, feed));
            context.SignalEntity(entityId, nameof(DispatchInstance.SetFragmentBuffer), lineBuffer);
            context.SignalEntity(entityId, nameof(DispatchInstance.SetReadLength), filePosition);
            dispatch.FragmentBuffer = lineBuffer;
            dispatch.ReadLength = filePosition;

            if (!fileComplete)
            {
                await context.CreateTimer(nextrun, CancellationToken.None).ConfigureAwait(true);
                context.ContinueAsNew((comb, dispatch));
            }
        }

Known workarounds

I'll probably have to create a function that terminates hanging orchestration instances.

App Details

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

ghost commented 4 years ago

I've found a possible suspect here - I have an orchestration client that runs every minute that starts the FileOrchestrator orchestrator as a singleton process. Given this orchestrator generally restarts (ContinueAsNew) at the top of the hour (which closely matches up with my issue observation time), the client seems to think that the process is terminated and slip into the execution queue and start a new instance of the same orchestrator. Then when the original instance of the singleton orchestrator restarts, my guess is that these instances both run in tandem, perhaps sharing the same task hub queue items and causing some sort of conflict that destroys the necessary scheduling items.

The workaround I'm implementing is to simply not run the client at the top of the hour:

[TimerTrigger("0 */1 * * * *", RunOnStartup = true)]TimerInfo timer //Old
[TimerTrigger("0 1-59 * * * *", RunOnStartup = true)]TimerInfo timer //New

That being said, if my hypothesis is correct, the Durable Functions SDK should either (a) somehow throw an exception when an orchestration attempts to launch an orchestration instance between two eternal executions, or (b) throw an exception in the orchestration instance after running ContinueAsNew if another instance with the same name is being executed.

I'll report back after letting this run for a while to see if this does the trick!

ghost commented 4 years ago

Yep, the timer change appeared to mitigate the issue - my functions have been running for a while without failing to restart. Hope this helps with root cause diagnosis!

ghost commented 4 years ago

Actually, scratch my last - the issue returned.

However, I think this was caused by another change I made - upon investigating my eternal function run history, I found the following:

Here's what my history looks like for a function that is running correctly: image

And here's what it looks like for a function that is not restarting: image

When looking at the relevant snippet of my orchestration code, I believe I found the problem:

List<Task> setTasks = new List<Task>
{
    context.CallEntityAsync(entityId, nameof(DispatchInstance.SetFragmentBuffer), lineBuffer),
    context.CallEntityAsync(entityId, nameof(DispatchInstance.SetReadLength), filePosition),
};
await Task.WhenAll(setTasks).ConfigureAwait(true);

The .ConfigureAwait(true) settings were added in order to ensure orchestration runs on a single thread. However, given that multiple entities are being called, there appears to be some sort of race condition that occurs sporadically where returning to the same thread produces deadlock!

This pattern does not seem to deadlock for calling activities, but I suppose other single-threaded contexts such as entities or sub-orchestrations could face this issue?

In the meantime, I've mitigated the issue by calling the two entity updates in sequence.

ghost commented 4 years ago

Following up on this - I've started encountering a similar problem on some other eternal orchestrations - each time, there seems to be an issue specifically with CallEntityAsync where the durable entity is updated successfully, but the callback to the parent orchestration function never runs.

Could there perhaps be an issue with the task hub queueing mechanism in this particular instance?

ghost commented 4 years ago

After following this up with Azure Support, my issue appears to be related to this one: https://github.com/Azure/azure-functions-durable-extension/issues/1094

I will try upgrading to v2.1 when it is released to see if this resolves my issue.