Open RobTF opened 4 years ago
Thanks Rob. I’ll take a second look at this.
Hi @RobTF, I'd also like to look at some of our internal telemetry to see if I can better understand what state your app was in when this happened. Could you share the following info with me?
That should be enough for me to find everything I need. If you're not comfortable sharing the storage account name, it you can at least give me a subset of the name, I can probably still find it.
Hi Chris,
Many thanks for looking into this - here's the info;
Item | Value |
---|---|
Region | West Europe |
Task Hub | hierarchytasks |
Storage Account Name | Ends with taskstr (same resource group as function app) |
regards,
Rob
Hi, @RobTF,
I've recently started to encounter a similar kind of problem. However, in our case, the CPU peaks and then drops back to normal. The common symptom I've encountered is that when this happens, all the TaskScheduledEvents
end up on in the worker's Dead Letter Queue with the MaxDeliveryCountExceeded
error. Any chance you had the same symptom?
Been trying for a couple of days to understand what's going on and I can't seem to find something logical. I'll try collecting a dump as well if I get the timing right.
Hi,
Back in March I reported an issue whereby after some time the CPU usage of the process running a durable function would jump to 100%, drown the host and never recover. This was regardless of the amount of useful work the function app actually did.
The original issue was #271 . I have opened a new issue in case the underlying cause is different. This issue was closed back in April after a fix and we had not seen the issue recur since until the 7th Jan 2020 at around 9:44PM UTC. We had made no code changes to the affected service. Since this time we continually see the problem recur with the affected function app much as we did prior to the fix in April.
We saw this CPU utilization on the App Service Plan;
Restarting the affected function app clears the issue for a while, but it will always come back at some point, between 5 minutes up to a day or so. Under normal operation the CPU usage is around 5-10%.
Even when it is in this state, the Kudu console worked just enough for me to get a profile dump from the process whilst it was in the faulty state (attached below).
Image showing part of dump in Visual Studio;
Full dump file diagsession.zip
This dump appears to show the problem stems from the
AzureStorageOrchestrationService.GetNextSessionAsync
method which is eating 91% CPU and continually callingAbandonMessagesAsync
.We are using version 1.8 of the durable function library on an Azure service plan.
Any help with this would be much appreciated as I'd rather not have to keep babysitting the service. I'm happy to provide any further information needed.
regards,
Rob