Closed philwhere closed 2 years ago
Hi @philwhere,
Thanks for reaching out. I've created an internal thread where we're investigating your incident, I'll get back to you as soon as we make progress.
In the meantime, a preliminary look at our telemetry does suggest that your orchestration is somehow getting stuck when receiving it's last Activity result. For some reason, your orchestration's last invocation logs a FunctionAwaited
message implying that it still has work to wait for, even though by then it should have received all results. I don't have a good understanding of what's going on with it just yet, but I'll look to keep you updated.
Seeing something rather similar. Just filed #2135
@philwhere @davidmrdavid Just confirmed this is an issue with 2.6.1; rolling back to 2.6.0 made my similar issue go away.
@philwhere - can you please confirm whether your issue is mitigated when rolling back to 2.6.0? That would help us establish a pattern. Thanks!
@davidmrdavid the app is currently on 2.6.0
@philwhere - my apologies, got confused across threads. Thanks, still investigating this issue.
Hi @philwhere. I noticed your application went through an automated "Functions Host" upgrade from version version 3.4.2 to 3.5.2 on ~4/4, which according to your original post is the starting time of this issue. So we have a correlation here that's worth investigating.
Could you please try reverting back to Functions Host version 3.4.2 using the instructions outlined here: https://docs.microsoft.com/en-us/azure/azure-functions/set-runtime-version?tabs=portal#automatic-and-manual-version-updates . Then, please let us know if the issue reproduces. If it does, then we'll know it's not related to that automatic component upgrade. Thanks!
Another follow-up, @philwhere.
Can you please confirm that there were no code changes around the time this issue started? In general, the behavior we're seeing for you app can sometimes be explained by an "illegal await", which is when your orchestration code await
s a non-durable API. Can you please confirm for us that this is not the case? Thanks!
Yes, there have been no code changes. The last deployment finished 2022-03-14T20:56:17Z
.
I'll let my team know about the Functions Host action point and I'll let you know if I see any changes π€
Thanks @davidmrdavid
@davidmrdavid Unfortunately (or fortunately), I did not see any changes in behaviour after the change to 3.4.2. I checked both in-flight instances and new instances created after the change.
Thanks @philwhere.
It seems you're able to reproduce this somewhat reliably. Are you able to put together a small(er) repro that I can use to help dig further into this? I've been trying to repro this in several ways and I've been unsuccessful.
@davidmrdavid In case this is helpful, I'd been seeing similar behavior and had posted a very small repro case in this issue. In this case, with 2.6.1, I never got the "Sent ping" log message.
Thanks @kirk-marple. I've actually been using your small repro to try to debug this :-) . I'll reply shortly in our thread to with a .zip
of my test project. I'll need a few minutes
Hi @philwhere.
I just made a comment summarizing our investigation of your incident, and that of kirk-marple, in this thread: https://github.com/Azure/azure-functions-durable-extension/issues/2135#issuecomment-1118211435
Please give that a quick look for an update. Normally, I'd re-write it all here but it is quite wordy and I'm handling both of your investigations as one work item given their similarities in telemetry.
After checking out that update, I wanted to know if there's anything new on your end. Has this issue occurred again? Do you have a repro I can test out?
In general, a repro would be the fastest path for a root cause analysis here. I know those aren't easy to produce, but so far the DF team hasn't found a way to repro this from the samples and descriptions given, so any pointers form affected users are very valuable. Thank you again!
Hi @davidmrdavid Apologies for the delay, I've been on leave for a while.
Thanks for trying. Honestly, I think I'd probably have a hard time reproducing it with a new app as well. Last time we deployed it, it was working fine up until a certain point. It was only after a certain point that everything started getting stuck. I only know that our app in its current state, is having this issue.
I can ask my team if it's okay to send you the source code privately. If that would that help, let me know how to get it to you. Cheers
Thanks @philwhere. You can always reach me privately at dajusto@microsoft.com
in case your team decides it's ok to share the repro. In the meantime, I'll continue with the next steps I outlined in my linked update above ^.
Hi @philwhere. I have received your email but haven't had a chance to look at it too deeply. I'll reply there as soon as I do.
In the meantime, I wanted to share with you the update blurb that I've been sharing with users that seem affected by this behavior. Please see below:
""" Just following up here with an update.
The latest DF Extension release (v2.7.1) contains https://github.com/Azure/azure-functions-durable-extension/pull/2168 root cause of this behavior.
If this check triggers, we will throw an exception on your orchestrator code, which should come up in our internal telemetry. I believe it would also appear in your own log traces within Application Insights, but I don't know that for certain. Regardless, if the issue were to re-occur with this DF Extension version, this new telemetry would give us confirmation of whether this "illegal await" problem is to blame.
To be clear: this new telemetry won't tell us where the illegal await is taking place, it will just confirm that it is occurring.
With that confirmation, we should be able to invest more resources into finding a root cause. Please let us know if you're able to repro this after upgrading to DF 2.7.1, if you decide to do so. Thanks! """
This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.
@davidmrdavid Has anyone in your team had a chance to look at the code at all? The team that owns this project is happy to update the package on the next release but they're having some issues sourcing a developer at the moment. I'm not really on the team, I'm just helping out on the side
Hi @philwhere - apologies for the late response, I thought I had replied already.
I did take a look at the code and did not find any immediate issues with it. To help us make progress in this investigation, it would be most helpful for the team to repro this with DF Extension 2.7.2 for us to see if our new telemetry is able to flag this issue as being caused by "an illegal await".
This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.
Hi @davidmrdavid,
Since releasing the update to 2.7.2 on 18 July, we haven't had any issues until recently. We're now seeing the exact same behaviour as before.
The earliest instance that I can find with the issue has LastUpdatedTime: 2022-10-13T10:03:43.9087807Z
and the instance ID is 5a2767c765c34978a1dbac243cff22ad
Here are a couple more just in case it's useful: 450de3405da649ecb1b934fdec3f41f7 fbecd80774e045a994c7fcd46193f05f 5a3ae17c94f34435b3e157835d6b6142 0c5b45b250134bbaacba091c9e8b5a2e
Now that the telemetry is in place, can you please have a look to see if there are any illegal awaits? I'm about to go on leave for the rest of November so I'll probably just get the team to re-deploy again to hopefully keep the lights on for a while.
Thanks
Hi @philwhere, thanks for the report.
Could you please provide me with timestamp ranges for those new instanceIDs? Otherwise it's a rather broad query we have to make.
Also - is there any chance you're modifying the default Newtonsoft.JSON serializer in your app, or any global Newtonsoft.JSON config changes? With kirk (also in this thread), we determined that modifying Newtonsoft.JSON's default serializer can lead to this kind of error in some cases. Thanks!
Hi @davidmrdavid thanks for the quick response!
I actually did find some custom serialization in our app, although it's just following what's in the docs here.
We have this in the Startup.cs (IErrorSerializerSettingsFactory)
builder.Services.AddSingleton<IErrorSerializerSettingsFactory, CustomErrorSerializer>();
I sent through the source code a while back to your email if you want a deeper look (9th May).
I'll try get the team to upgrade to 2.8.1 anyway since your changes are in. I won't be able to verify in the short term whether this fixes it or not because the issue happens so randomly, but it looks promising! π
Here is some more info on those instances. Note that I have terminated the instances as part of rectifying the customer bookings. I hope that doesn't impact any investigation.
InstanceId | CreatedTime | LastUpdatedTime |
---|---|---|
5a2767c765c34978a1dbac243cff22ad | 2022-10-12T10:04:17.2487505Z | 2022-11-07T08:35:54.0363595Z |
450de3405da649ecb1b934fdec3f41f7 | 2022-10-12T10:04:15.2888472Z | 2022-11-07T08:35:23.9958381Z |
fbecd80774e045a994c7fcd46193f05f | 2022-10-12T10:04:13.5995929Z | 2022-11-07T08:03:10.0254040Z |
5a3ae17c94f34435b3e157835d6b6142 | 2022-10-12T10:04:05.9688216Z | 2022-11-07T08:07:01.3218174Z |
0c5b45b250134bbaacba091c9e8b5a2e | 2022-10-12T10:04:21.6488629Z | 2022-11-07T09:09:40.7552665Z |
The serialization is not quite what's in https://github.com/Azure/azure-functions-durable-extension/issues/2135. Do you think the Startup code I posted would still be an issue?
Did the telemetry from 2.7.1 give us any insight on illegal awaits?
Big apologies if the replies are a bit (or a lot) delayed. I'm going to be on leave starting tomorrow and returning in December (going to California!)
Thanks @philwhere, and please do enjoy California :) . I'm going there soon myself and look forward to the good weather and great food!
To answer your questions:
(1) If you're utilizing builder.Services.AddSingleton
to add your customer serialization config, I think you should be safe. My understanding is that the serialization-induced problem is caused primarily by directly manipulating the global JsonConvert.DefaultSettings
. Still, I'll keep this possibility in the back of my mind. I agree that updating to the latest Extension is a good idea.
(2) Actually, yes! When looking at your logs, do you see an exception starting with
"System.InvalidOperationException: Multithreaded execution was detected. This can happen if the orchestrator function code awaits on a task that was not created by a DurableOrchestrationContext method. More details can be found in this article https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-checkpointing-and-replay#orchestrator-code-constraints. at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableOrchestrationContext.ThrowIfInvalidAccess() [...]" ?
I see this for your instanceIDs on our internal telemetry, and I believe this should be surfacing on your end as well. Please let me know if it isn't! In any case, this makes me think that the problem is not related to serialization, as the error clearly indicates the problem is with multi-threading.
In short: please let me know if you're able to see these errors on your end as well. Also please let me know if these orchestrators are stuck in the "Running" state, or if they're not showing up as "Failed" on your end. Thanks!
Hi @davidmrdavid
I'm just writing to give an update on our findings.
So first to answer your questions: Yes, we saw the errors on our side and yes, they were showing as Running.
I'm about 95% confident that we've found the actual culprit! We've been able to trace the start of the issue to when we added the "Dynatrace OneAgent" extension to our application.
Our deployments deploy to a new slot (without the extension) then a slot swap is done which explains why new deployments fixed the issue for us in the past.
We're going to just not install it on this particular app going forward and hopefully we'll be set. Thanks so much for your help with this and hopefully this can help others.
Hi @philwhere,
great news! Thanks for sharing.
Super interesting bit about the OneAgent correlation: this is the first time I hear about this extension, but from what I gather it essentially instruments your app to gather performance data, is that correct? If that's the case, then it could be the case that this instrumentation may affect thread ordering, which in turn could cause non-determinism. Interesting, though still just a theory :)
If you have any more details about OneAgent that you think may help explain this error, that would be really helpful. Thanks!
@philwhere: I'm interested in trying to reproduce this issue to see if Dynatrace OneAgent can cause non-determinism in simple orchestrators. Do you have any guidance / pointers on how to minimally set this up in a way similar to how Dynatrace interfaced with your DF app?
@davidmrdavid I also think it's probably doing something that's affecting the threading based on the fact that we start to see those "Multithreaded execution" logs.
If you do want to have a try on your end, you can follow this guide here. Just note that you'll need to actually have a Dynatrace environment to connect the extension to. Looks like there is a free trial available but I've never set one up myself so all I can say is good luck. π
@philwhere I bumped with a similar issue while running very simple orchestration with single activity. ExecutionCompleted
event is also missed in my case and instance is stuck in Running
state.
Additionally I get the following error in logs:
Error: The orchestrator function completed on a non-orchestrator thread!
Exception: System.InvalidOperationException: An invalid asynchronous invocation was detected. This can be caused by awaiting non-durable tasks in an orchestrator function's implementation or by middleware that invokes asynchronous code.
at Microsoft.Azure.Functions.Worker.Extensions.DurableTask.FunctionsOrchestrationContext.ThrowIfIllegalAccess() in /_/src/Worker.Extensions.DurableTask/FunctionsOrchestrationContext.cs:line 133
Category: Microsoft.Azure.Functions.Worker.Extensions.DurableTask.DurableTaskFunctionsMiddleware
Backend: Azure Storage Runtime: ~4 Stack: .NET 7 Isolated Packages
Extensions: .NET Datadog APM 2.32.0
The root cause also seems to be related to an extension (DataDog in my case). At least when I remove the extension then I see no errors and orchestrator reaches Completed
state successfully.
Just wanted to share this weird behavior π.
Resolved
I can now confirm that the issue was caused by adding the "Dynatrace OneAgent" site extension to our function app in Azure. How exactly it breaks is unclear, but the extension seems to do something around threading that results in illegal awaits in the durable task framework. See more here: https://github.com/Azure/azure-functions-durable-extension/issues/2133#issuecomment-1343509726
Other APM site extensions may also cause issues. Apparently .NET Datadog APM produces the same results. https://github.com/Azure/azure-functions-durable-extension/issues/2133#issuecomment-1598787850
Original post below
Description
My Durable Functions app has been running fine in Azure for while but has suddenly started to misbehave. Instances are completing their entire execution for the orchestrator function, however they are no longer completing their instance status. The statuses stay as "Running" and the "ExecutionCompleted" event is not added to the instance history table.
Expected behavior
At the end of an orchestration function, the instance should complete, fail or restart (ContinueAsNew).
Actual behavior
The orchestrator runs all the way through logically but the instance does not end correctly. I have found a single instance that has been marked as Completed since the issue appeared.
Relevant source code snippets
(Chopped up and simplified example to make sense of the screenshots)
Known workarounds
Redeploying to point to a new task hub worked for a few weeks before the problem returned.
App Details
Screenshots
Columns
Correctly ending instance.
Instances that have run all the way through but are still running and missing the ExecutionCompleted line.
If deployed to Azure