Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

High CPU and 404 Table Errors load testing a Premium v2 Durable Functions Chaining Sample on Dedicated. #666

Open FinVamp1 opened 5 years ago

FinVamp1 commented 5 years ago

Describe the bug A clear and concise description of what the bug is. Please make an effort to fill in all the sections below; the information will help us investigate your issue.

Investigative information

If deployed to Azure

To Reproduce Steps to reproduce the behavior:

1) Take the Chaining Sample and Deploy to a V2 application. (Upgraded to 1.8.0) 2) Generate a VS Load test for 25 users and 30 minutes to http://fintestdurablestress.azurewebsites.net/orchestrators/E1_HelloSequence 3) The CPU will go to 100% and the initial startup will generate 404 errors 4) This app is configured with the Durable Task Extension in a separate Storage account from AzureWebJobsStorage. 5) These are the settings for the Durable Task Extension.

"durableTask": {
  "hubName": "SampleHubVS",
  "ControlQueueBatchSize": 32,
  "PartitionCount": 2,
  "ControlQueueVisibilityTimeout": "00:05:00",
  "WorkItemQueueVisibilityTimeout": "00:05:00",
  "AzureStorageConnectionStringName": "TestDurableStorage",
  "TraceInputsAndOutputs": false,
  "LogReplayEvents": false
}

While the Orchestrations are under stress we see an increased number of 404 errors from Table Storage.

Time 3:35:25 PM Duration 4 ms Outgoing Command GET fintestdurablestorage/SampleHubVSInstances Result code 404 Category Function.HttpStart LogLevel Information InvocationId 70b05b11-fda7-472a-ac80-e81ec9417778 https://fintestdurablestorage.table.core.windows.net:443/SampleHubVSInstances(PartitionKey='394a269be60f42bc81ce72ff168ddead',RowKey='')?$select=ExecutionId%2CName%2CVersion%2COutput%2CCustomStatus%2CCreatedTime%2CLastUpdatedTime%2CRuntimeStatus%2CPartitionKey%2CRowKey%2CTimestamp%2CETag

While not required, providing your orchestrator's source code in anonymized form is often very helpful when investigating unexpected orchestrator behavior.

Expected behavior

  1. The Orchestrations fail to start or reply with a 429 Throttling errors.

Actual behavior

  1. The single instance tries to handle the load and is slow to do so.

Screenshots If applicable, add screenshots to help explain your problem.

image

Known workarounds Provide a description of any known workarounds you used. I will test again with extendedSessions disabled. https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#orchestrator-function-replay

Additional context

Executing 'E1_HelloSequence' (Reason='', Id=f62fa09e-b783-4fd7-9656-d45e1d7c3f91) Executing 'E1_HelloSequence' (Reason='', Id=9586aea6-26db-492e-8876-18dcf0a14ead) Executing 'E1_HelloSequence' (Reason='', Id=d415f395-2516-4999-ba7e-7ee6d9e25920)

cgillum commented 5 years ago

There are several issues described in this bug. If I understand correctly, they can be summarized as follows:

  1. Dependency errors from Azure Storage: The HTTP 404 errors are from Azure Storage and not from the function app, but are handled gracefully. The fact that you see a dependency error is a known issue being tracked here: https://github.com/Azure/azure-functions-durable-extension/issues/593.
  2. Lack of throttling in HTTP triggers: The high CPU and no 429s is a behavior of HTTP triggered functions, and is not specific to durable. You can find more information on how to configure throttling here: https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-http-webhook#trigger---hostjson-properties

Did I understand the concerns correctly? Were there any other concerns besides these?

FinVamp1 commented 5 years ago

Thank you Chris. I think these issues arise as a consequence of one larger concern. How do you determine for dedicated how many instances you may need to handle a parallel number of Orchestration calls which launch sequential activities? If you enable the HTTP Dynamic Throttling functionality then we'll return 429 if the counters exceed 80% which will give you a single instance throughput. What do you think?

cgillum commented 5 years ago

The optimal number will depend on the workload itself. For example, if we're just talking about sequences and the activity functions are expected to be heavy in CPU usage, then you would probably want a number of VMs (and partitions) to equal the number of concurrent orchestrations you need to support.

For other workloads, I expect some amount of trial and error would be required. I agree we probably need some better guidance here though.

FinVamp1 commented 5 years ago

For Dedicated, HTTP Throttles will not work as we don't track the performance counters I think. https://github.com/Azure/azure-functions-host/blob/2b8c2b851e2a415d70b40e7d47bc415a0a82475a/src/WebJobs.Script/Environment/EnvironmentSettingNames.cs#L21

https://github.com/Azure/azure-functions-host/blob/4d404725772eb5652907e749737154553c84c126/src/WebJobs.Script.WebHost/Middleware/HttpThrottleMiddleware.cs#L31

So from https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#performance-targets if you want to support say 1000 Orchestrations per\sec then at 5 instances per second for a small you might need at least 100 instances if you're running on Large Premium instances. Does that sound right?

mpaul31 commented 5 years ago

@FinVamp1 any more details on this? I am trying to diagnosis an issue where the CPU is spiking at 100% and not returning back down. Trying to figure out if this is related.

also, right now i am not using a dedicated storage account for the durables and we are also on a small app service plan.

mpaul31 commented 5 years ago

@cgillum commenting off of what @FinVamp1 mentioned, is the document saying an A1 VM can only support 5 concurrent orchestrations at a time?

https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#performance-targets

cgillum commented 5 years ago

@mpaul31 The document is saying that you can expect a throughput of up to 5 activity functions per second on a single A1 VM running a single orchestration. The document isn't making any statement about orchestration concurrency on a single VM, except that you can configure your desired per-VM maximum concurrency through host.json settings.

mpaul31 commented 5 years ago

hmmm how would you recommend planning out your VMs capacity? Unfortunately we are not able to use the consumption plan at the moment. Also, let's assume a single VM. Would it make sense to increase the partition size greater than the default or does that only come into plan when scaling out VMs?

cgillum commented 5 years ago

I think testing will be required to determine the right VM capacity because the right number could vary quite a bit depending on the actual workload. One thing I can tell you, however, is that it's ideal to have a partition count greater than or equal to the VM count (having them be equal is the most optimal in terms of I/O costs).

Regarding high CPU, we have another issue tracking some high CPU issues that other customers have encountered. You may want to take a look at https://github.com/Azure/durabletask/issues/271.

mpaul31 commented 5 years ago

Hi Chris

Does the new 1.8.1 release contain the fix for the infinite loop with message ordering?

On Apr 8, 2019, at 3:29 PM, Chris Gillum notifications@github.com wrote:

I think testing will be required to determine the right VM capacity because the right number could vary quite a bit depending on the actual workload. One thing I can tell you, however, is that it's ideal to have a partition count greater than or equal to the VM count (having them be equal is the most optimal in terms of I/O costs).

Regarding high CPU, we have another issue tracking some high CPU issues that other customers have encountered. You may want to take a look at Azure/durabletask#271.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

cgillum commented 5 years ago

Yes it does. Sorry for forgetting to call that out in the release notes. It was fixed by this PR: https://github.com/Azure/azure-functions-durable-extension/pull/701

mpaul31 commented 5 years ago

OK no problem thanks man!

On May 4, 2019, at 11:20 PM, Chris Gillum notifications@github.com wrote:

Yes it does. Sorry for forgetting to call that out in the release notes. It was fixed by this PR: #701

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.