CPU utilization spikes to 100% and does not recover

RobTF commented 4 years ago

Hi,

Back in March I reported an issue whereby after some time the CPU usage of the process running a durable function would jump to 100%, drown the host and never recover. This was regardless of the amount of useful work the function app actually did.

The original issue was #271 . I have opened a new issue in case the underlying cause is different. This issue was closed back in April after a fix and we had not seen the issue recur since until the 7th Jan 2020 at around 9:44PM UTC. We had made no code changes to the affected service. Since this time we continually see the problem recur with the affected function app much as we did prior to the fix in April.

We saw this CPU utilization on the App Service Plan;

Restarting the affected function app clears the issue for a while, but it will always come back at some point, between 5 minutes up to a day or so. Under normal operation the CPU usage is around 5-10%.

Even when it is in this state, the Kudu console worked just enough for me to get a profile dump from the process whilst it was in the faulty state (attached below).

Image showing part of dump in Visual Studio;

Full dump file diagsession.zip

This dump appears to show the problem stems from the AzureStorageOrchestrationService.GetNextSessionAsync method which is eating 91% CPU and continually calling AbandonMessagesAsync.

We are using version 1.8 of the durable function library on an Azure service plan.

Any help with this would be much appreciated as I'd rather not have to keep babysitting the service. I'm happy to provide any further information needed.

regards,

Rob

cgillum commented 4 years ago

Thanks Rob. I’ll take a second look at this.

cgillum commented 4 years ago

Hi @RobTF, I'd also like to look at some of our internal telemetry to see if I can better understand what state your app was in when this happened. Could you share the following info with me?

Azure region
Task Hub name
Storage account name

That should be enough for me to find everything I need. If you're not comfortable sharing the storage account name, it you can at least give me a subset of the name, I can probably still find it.

RobTF commented 4 years ago

Hi Chris,

Many thanks for looking into this - here's the info;

Item	Value
Region	West Europe
Task Hub	hierarchytasks
Storage Account Name	Ends with `taskstr` (same resource group as function app)

regards,

Rob

maximebousquet commented 4 years ago

Hi, @RobTF,

I've recently started to encounter a similar kind of problem. However, in our case, the CPU peaks and then drops back to normal. The common symptom I've encountered is that when this happens, all the TaskScheduledEvents end up on in the worker's Dead Letter Queue with the MaxDeliveryCountExceeded error. Any chance you had the same symptom?

Been trying for a couple of days to understand what's going on and I can't seem to find something logical. I'll try collecting a dump as well if I get the timing right.

Azure / durabletask

CPU utilization spikes to 100% and does not recover #361