Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
715 stars 270 forks source link

Instances are getting stuck Running even though execution has been completed. ExecutionCompleted is no longer happening. #2133

Closed philwhere closed 2 years ago

philwhere commented 2 years ago

Resolved

I can now confirm that the issue was caused by adding the "Dynatrace OneAgent" site extension to our function app in Azure. How exactly it breaks is unclear, but the extension seems to do something around threading that results in illegal awaits in the durable task framework. See more here: https://github.com/Azure/azure-functions-durable-extension/issues/2133#issuecomment-1343509726

Other APM site extensions may also cause issues. Apparently .NET Datadog APM produces the same results. https://github.com/Azure/azure-functions-durable-extension/issues/2133#issuecomment-1598787850

Original post below

Description

My Durable Functions app has been running fine in Azure for while but has suddenly started to misbehave. Instances are completing their entire execution for the orchestrator function, however they are no longer completing their instance status. The statuses stay as "Running" and the "ExecutionCompleted" event is not added to the instance history table.

Expected behavior

At the end of an orchestration function, the instance should complete, fail or restart (ContinueAsNew).

Actual behavior

The orchestrator runs all the way through logically but the instance does not end correctly. I have found a single instance that has been marked as Completed since the issue appeared.

Relevant source code snippets

(Chopped up and simplified example to make sense of the screenshots)

        [FunctionName("Orchestrator")]
        public async Task RunOrchestrator(
            [OrchestrationTrigger] IDurableOrchestrationContext context,
            ILogger log)
        {
            var data = context.GetInput<OrchestratorInputModel>();
            try
            {
                if (data.Something)
                {
                    await context.CallActivityAsync("Calculate", data);
                    context.ContinueAsNew(data);
                    return;
                }
                await context.CallActivityAsync("Publish", data);
            }
            catch (SomeException exception)
            {
                await context.CallActivityAsync("UnrecoverableErrorHandler", exception);
            }
        }

Known workarounds

Redeploying to point to a new task hub worked for a few weeks before the problem returned.

App Details

Screenshots

Columns bandicam 2022-04-07 12-24-07-299

Correctly ending instance. bandicam 2022-04-07 12-21-42-707

Instances that have run all the way through but are still running and missing the ExecutionCompleted line. bandicam 2022-04-07 12-21-54-390 bandicam 2022-04-07 12-23-20-433 bandicam 2022-04-07 12-22-01-870

If deployed to Azure

davidmrdavid commented 2 years ago

Hi @philwhere,

Thanks for reaching out. I've created an internal thread where we're investigating your incident, I'll get back to you as soon as we make progress.

In the meantime, a preliminary look at our telemetry does suggest that your orchestration is somehow getting stuck when receiving it's last Activity result. For some reason, your orchestration's last invocation logs a FunctionAwaited message implying that it still has work to wait for, even though by then it should have received all results. I don't have a good understanding of what's going on with it just yet, but I'll look to keep you updated.

kirk-marple commented 2 years ago

Seeing something rather similar. Just filed #2135

kirk-marple commented 2 years ago

@philwhere @davidmrdavid Just confirmed this is an issue with 2.6.1; rolling back to 2.6.0 made my similar issue go away.

davidmrdavid commented 2 years ago

@philwhere - can you please confirm whether your issue is mitigated when rolling back to 2.6.0? That would help us establish a pattern. Thanks!

philwhere commented 2 years ago

@davidmrdavid the app is currently on 2.6.0

davidmrdavid commented 2 years ago

@philwhere - my apologies, got confused across threads. Thanks, still investigating this issue.

davidmrdavid commented 2 years ago

Hi @philwhere. I noticed your application went through an automated "Functions Host" upgrade from version version 3.4.2 to 3.5.2 on ~4/4, which according to your original post is the starting time of this issue. So we have a correlation here that's worth investigating.

Could you please try reverting back to Functions Host version 3.4.2 using the instructions outlined here: https://docs.microsoft.com/en-us/azure/azure-functions/set-runtime-version?tabs=portal#automatic-and-manual-version-updates . Then, please let us know if the issue reproduces. If it does, then we'll know it's not related to that automatic component upgrade. Thanks!

davidmrdavid commented 2 years ago

Another follow-up, @philwhere. Can you please confirm that there were no code changes around the time this issue started? In general, the behavior we're seeing for you app can sometimes be explained by an "illegal await", which is when your orchestration code awaits a non-durable API. Can you please confirm for us that this is not the case? Thanks!

philwhere commented 2 years ago

Yes, there have been no code changes. The last deployment finished 2022-03-14T20:56:17Z. I'll let my team know about the Functions Host action point and I'll let you know if I see any changes 🀞 Thanks @davidmrdavid

philwhere commented 2 years ago

@davidmrdavid Unfortunately (or fortunately), I did not see any changes in behaviour after the change to 3.4.2. I checked both in-flight instances and new instances created after the change.

davidmrdavid commented 2 years ago

Thanks @philwhere.

It seems you're able to reproduce this somewhat reliably. Are you able to put together a small(er) repro that I can use to help dig further into this? I've been trying to repro this in several ways and I've been unsuccessful.

kirk-marple commented 2 years ago

@davidmrdavid In case this is helpful, I'd been seeing similar behavior and had posted a very small repro case in this issue. In this case, with 2.6.1, I never got the "Sent ping" log message.

davidmrdavid commented 2 years ago

Thanks @kirk-marple. I've actually been using your small repro to try to debug this :-) . I'll reply shortly in our thread to with a .zip of my test project. I'll need a few minutes

davidmrdavid commented 2 years ago

Hi @philwhere.

I just made a comment summarizing our investigation of your incident, and that of kirk-marple, in this thread: https://github.com/Azure/azure-functions-durable-extension/issues/2135#issuecomment-1118211435

Please give that a quick look for an update. Normally, I'd re-write it all here but it is quite wordy and I'm handling both of your investigations as one work item given their similarities in telemetry.

After checking out that update, I wanted to know if there's anything new on your end. Has this issue occurred again? Do you have a repro I can test out?

In general, a repro would be the fastest path for a root cause analysis here. I know those aren't easy to produce, but so far the DF team hasn't found a way to repro this from the samples and descriptions given, so any pointers form affected users are very valuable. Thank you again!

philwhere commented 2 years ago

Hi @davidmrdavid Apologies for the delay, I've been on leave for a while.

Thanks for trying. Honestly, I think I'd probably have a hard time reproducing it with a new app as well. Last time we deployed it, it was working fine up until a certain point. It was only after a certain point that everything started getting stuck. I only know that our app in its current state, is having this issue.

I can ask my team if it's okay to send you the source code privately. If that would that help, let me know how to get it to you. Cheers

davidmrdavid commented 2 years ago

Thanks @philwhere. You can always reach me privately at dajusto@microsoft.com in case your team decides it's ok to share the repro. In the meantime, I'll continue with the next steps I outlined in my linked update above ^.

davidmrdavid commented 2 years ago

Hi @philwhere. I have received your email but haven't had a chance to look at it too deeply. I'll reply there as soon as I do.

In the meantime, I wanted to share with you the update blurb that I've been sharing with users that seem affected by this behavior. Please see below:

""" Just following up here with an update.

The latest DF Extension release (v2.7.1) contains https://github.com/Azure/azure-functions-durable-extension/pull/2168 root cause of this behavior.

If this check triggers, we will throw an exception on your orchestrator code, which should come up in our internal telemetry. I believe it would also appear in your own log traces within Application Insights, but I don't know that for certain. Regardless, if the issue were to re-occur with this DF Extension version, this new telemetry would give us confirmation of whether this "illegal await" problem is to blame.

To be clear: this new telemetry won't tell us where the illegal await is taking place, it will just confirm that it is occurring.

With that confirmation, we should be able to invest more resources into finding a root cause. Please let us know if you're able to repro this after upgrading to DF 2.7.1, if you decide to do so. Thanks! """

ghost commented 2 years ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

philwhere commented 2 years ago

@davidmrdavid Has anyone in your team had a chance to look at the code at all? The team that owns this project is happy to update the package on the next release but they're having some issues sourcing a developer at the moment. I'm not really on the team, I'm just helping out on the side

davidmrdavid commented 2 years ago

Hi @philwhere - apologies for the late response, I thought I had replied already.

I did take a look at the code and did not find any immediate issues with it. To help us make progress in this investigation, it would be most helpful for the team to repro this with DF Extension 2.7.2 for us to see if our new telemetry is able to flag this issue as being caused by "an illegal await".

ghost commented 2 years ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

philwhere commented 1 year ago

Hi @davidmrdavid,

Since releasing the update to 2.7.2 on 18 July, we haven't had any issues until recently. We're now seeing the exact same behaviour as before.

The earliest instance that I can find with the issue has LastUpdatedTime: 2022-10-13T10:03:43.9087807Z and the instance ID is 5a2767c765c34978a1dbac243cff22ad

Here are a couple more just in case it's useful: 450de3405da649ecb1b934fdec3f41f7 fbecd80774e045a994c7fcd46193f05f 5a3ae17c94f34435b3e157835d6b6142 0c5b45b250134bbaacba091c9e8b5a2e

Now that the telemetry is in place, can you please have a look to see if there are any illegal awaits? I'm about to go on leave for the rest of November so I'll probably just get the team to re-deploy again to hopefully keep the lights on for a while.

Thanks

davidmrdavid commented 1 year ago

Hi @philwhere, thanks for the report.

Could you please provide me with timestamp ranges for those new instanceIDs? Otherwise it's a rather broad query we have to make.

Also - is there any chance you're modifying the default Newtonsoft.JSON serializer in your app, or any global Newtonsoft.JSON config changes? With kirk (also in this thread), we determined that modifying Newtonsoft.JSON's default serializer can lead to this kind of error in some cases. Thanks!

philwhere commented 1 year ago

Hi @davidmrdavid thanks for the quick response!

Serialization

I actually did find some custom serialization in our app, although it's just following what's in the docs here. We have this in the Startup.cs (IErrorSerializerSettingsFactory) builder.Services.AddSingleton<IErrorSerializerSettingsFactory, CustomErrorSerializer>(); I sent through the source code a while back to your email if you want a deeper look (9th May).

I'll try get the team to upgrade to 2.8.1 anyway since your changes are in. I won't be able to verify in the short term whether this fixes it or not because the issue happens so randomly, but it looks promising! πŸ˜ƒ

Instances

Here is some more info on those instances. Note that I have terminated the instances as part of rectifying the customer bookings. I hope that doesn't impact any investigation.

InstanceId CreatedTime LastUpdatedTime
5a2767c765c34978a1dbac243cff22ad 2022-10-12T10:04:17.2487505Z 2022-11-07T08:35:54.0363595Z
450de3405da649ecb1b934fdec3f41f7 2022-10-12T10:04:15.2888472Z 2022-11-07T08:35:23.9958381Z
fbecd80774e045a994c7fcd46193f05f 2022-10-12T10:04:13.5995929Z 2022-11-07T08:03:10.0254040Z
5a3ae17c94f34435b3e157835d6b6142 2022-10-12T10:04:05.9688216Z 2022-11-07T08:07:01.3218174Z
0c5b45b250134bbaacba091c9e8b5a2e 2022-10-12T10:04:21.6488629Z 2022-11-07T09:09:40.7552665Z

Questions:

  1. The serialization is not quite what's in https://github.com/Azure/azure-functions-durable-extension/issues/2135. Do you think the Startup code I posted would still be an issue?

  2. Did the telemetry from 2.7.1 give us any insight on illegal awaits?

Big apologies if the replies are a bit (or a lot) delayed. I'm going to be on leave starting tomorrow and returning in December (going to California!)

davidmrdavid commented 1 year ago

Thanks @philwhere, and please do enjoy California :) . I'm going there soon myself and look forward to the good weather and great food!

To answer your questions: (1) If you're utilizing builder.Services.AddSingleton to add your customer serialization config, I think you should be safe. My understanding is that the serialization-induced problem is caused primarily by directly manipulating the global JsonConvert.DefaultSettings. Still, I'll keep this possibility in the back of my mind. I agree that updating to the latest Extension is a good idea.

(2) Actually, yes! When looking at your logs, do you see an exception starting with

"System.InvalidOperationException: Multithreaded execution was detected. This can happen if the orchestrator function code awaits on a task that was not created by a DurableOrchestrationContext method. More details can be found in this article https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-checkpointing-and-replay#orchestrator-code-constraints. at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableOrchestrationContext.ThrowIfInvalidAccess() [...]" ?

I see this for your instanceIDs on our internal telemetry, and I believe this should be surfacing on your end as well. Please let me know if it isn't! In any case, this makes me think that the problem is not related to serialization, as the error clearly indicates the problem is with multi-threading.

In short: please let me know if you're able to see these errors on your end as well. Also please let me know if these orchestrators are stuck in the "Running" state, or if they're not showing up as "Failed" on your end. Thanks!

philwhere commented 1 year ago

Hi @davidmrdavid

I'm just writing to give an update on our findings.

So first to answer your questions: Yes, we saw the errors on our side and yes, they were showing as Running.

Major news

I'm about 95% confident that we've found the actual culprit! We've been able to trace the start of the issue to when we added the "Dynatrace OneAgent" extension to our application.

Our deployments deploy to a new slot (without the extension) then a slot swap is done which explains why new deployments fixed the issue for us in the past.

image

We're going to just not install it on this particular app going forward and hopefully we'll be set. Thanks so much for your help with this and hopefully this can help others.

davidmrdavid commented 1 year ago

Hi @philwhere,

great news! Thanks for sharing.

Super interesting bit about the OneAgent correlation: this is the first time I hear about this extension, but from what I gather it essentially instruments your app to gather performance data, is that correct? If that's the case, then it could be the case that this instrumentation may affect thread ordering, which in turn could cause non-determinism. Interesting, though still just a theory :)

If you have any more details about OneAgent that you think may help explain this error, that would be really helpful. Thanks!

davidmrdavid commented 1 year ago

@philwhere: I'm interested in trying to reproduce this issue to see if Dynatrace OneAgent can cause non-determinism in simple orchestrators. Do you have any guidance / pointers on how to minimally set this up in a way similar to how Dynatrace interfaced with your DF app?

philwhere commented 1 year ago

@davidmrdavid I also think it's probably doing something that's affecting the threading based on the fact that we start to see those "Multithreaded execution" logs.

If you do want to have a try on your end, you can follow this guide here. Just note that you'll need to actually have a Dynatrace environment to connect the extension to. Looks like there is a free trial available but I've never set one up myself so all I can say is good luck. πŸ˜ƒ

savbace commented 1 year ago

@philwhere I bumped with a similar issue while running very simple orchestration with single activity. ExecutionCompleted event is also missed in my case and instance is stuck in Running state. Additionally I get the following error in logs:

Error: The orchestrator function completed on a non-orchestrator thread!
Exception: System.InvalidOperationException: An invalid asynchronous invocation was detected. This can be caused by awaiting non-durable tasks in an orchestrator function's implementation or by middleware that invokes asynchronous code.
   at Microsoft.Azure.Functions.Worker.Extensions.DurableTask.FunctionsOrchestrationContext.ThrowIfIllegalAccess() in /_/src/Worker.Extensions.DurableTask/FunctionsOrchestrationContext.cs:line 133
Category: Microsoft.Azure.Functions.Worker.Extensions.DurableTask.DurableTaskFunctionsMiddleware

Configuration

Backend: Azure Storage Runtime: ~4 Stack: .NET 7 Isolated Packages

Extensions: .NET Datadog APM 2.32.0

The root cause also seems to be related to an extension (DataDog in my case). At least when I remove the extension then I see no errors and orchestrator reaches Completed state successfully. Just wanted to share this weird behavior πŸ™‚.