Durable function performance & scale

dxynnez commented 3 weeks ago

Description

Hello team,

We recently conducted a stress test for the durable function fan-out & fan-in pattern and discovered something interesting.

The function code can be found at: https://github.com/dxynnez/DurableFunctionStressTest/blob/main/DurableFunctionStressTest/Function.cs

So there is a timer to trigger 800 orchestrators every 10mins; each orchestrator will trigger 1 activity function then fan-out to100 sub-orchestrators and then fan-in; each sub-orchestrator will trigger 1 activity function, then fan-out to 2 activities and then fan-in. So conceptually, there would be: ~80000 orchestrators & ~240000 activities getting triggered every 10mins.

The function app was originally running on a dedicated ASP with 3 I1V2 workers. The above load results into huge backlog and the CPU on all workers constantly hitting > 90%. 1) We then scale out the ASP to 12 workers and that seems to handle the load just fine, although it's a bit slow as the last orchestrator took around 3mins to complete. The CPU usage on all workers are still relatively high (around 75%). 2) We scale in the ASP to 6 workers and scale it up to I2V2. I2V2 (4 vCores) is about twice more powerful than the I1V2 (2 vCores), and we expect that the system would just behave the same as 1). But to our surprise, although this reduce the CPU usage on each worker to around 50%, the execution time dramatically increased - the slowest orchestrator took around 7mins to complete.

We further analyzed our telemetry, and the 1) setup just seem to complete twice as many activities as the 2) setup for almost ever minute within the 10mins interval (of course before all activities complete); and hence complete twice as many orchestrators & 2 times faster.

It looks like some of the configuration prevent our function app to fully utilize its compute power. But then the concurrency level of activity (maxConcurrentActivityFunctions) & orchestrator (maxConcurrentOrchestratorFunctions) are tied to the processor number (I2V2 has 4 cores while I1V2 only has 2) so for the 2) setup, each worker should be able to run more functions concurrently than the 1) - what else setting(s) might be causing the above behavior, then?

Expected behavior

function app running orchestrators be able to fully utilize its compute power.

Relevant source code snippets

https://github.com/dxynnez/DurableFunctionStressTest/blob/main/DurableFunctionStressTest/Function.cs

cgillum commented 3 weeks ago

You're basically comparing two different setups, each with 24 cores. In setup (1), those 24 cores are distributed across 12 workers (2 cores per worker). In setup (2), those 24 cores are distributed across 6 workers (4 cores per worker). And you're finding that the 12-worker setup completes the workload faster than the 6-worker setup. Is that accurate?

But then the concurrency level of activity (maxConcurrentActivityFunctions) & orchestrator (maxConcurrentOrchestratorFunctions) are tied to the processor number (I2V2 has 4 cores while I1V2 only has 2) so the 2) setup should be able to run more functions concurrently than the 1)

In dedicated plans, the default concurrency value is 10X the number of cores on the VM (using the value of Environment.ProcessorCount). So for 2-core machines the concurrency should be 20, and for 4-core machines the concurrency should be 40. However, between (1) and (2), you've cut the number of machines by half. With 12 dual-core machines, you have a global concurrency of 240 (12 2 10), and with 6 quad-core machines, you also have a global concurrency of 240 (6 4 10). From that perspective, I'd expect the results should be about the same.

I'm not sure why you'd see (2) being slower than (1), though. That feels counter-intuitive since (2) has more CPU headroom, unless there's some other I/O resource contention that I'm not aware of. If you turn on Durable Task Framework logging, are you able to see any warnings or errors that indicate whether something is going wrong?

dxynnez commented 3 weeks ago

You're basically comparing two different setups, each with 24 cores. In setup (1), those 24 cores are distributed across 12 workers (2 cores per worker). In setup (2), those 24 cores are distributed across 6 workers (4 cores per worker). And you're finding that the 12-worker setup completes the workload faster than the 6-worker setup. Is that accurate?

But then the concurrency level of activity (maxConcurrentActivityFunctions) & orchestrator (maxConcurrentOrchestratorFunctions) are tied to the processor number (I2V2 has 4 cores while I1V2 only has 2) so the 2) setup should be able to run more functions concurrently than the 1)

In dedicated plans, the default concurrency value is 10X the number of cores on the VM (using the value of Environment.ProcessorCount). So for 2-core machines the concurrency should be 20, and for 4-core machines the concurrency should be 40. However, between (1) and (2), you've cut the number of machines by half. With 12 dual-core machines, you have a global concurrency of 240 (12 2 10), and with 6 quad-core machines, you also have a global concurrency of 240 (6 4 10). From that perspective, I'd expect the results should be about the same.

I'm not sure why you'd see (2) being slower than (1), though. That feels counter-intuitive since (2) has more CPU headroom, unless there's some other I/O resource contention that I'm not aware of. If you turn on Durable Task Framework logging, are you able to see any warnings or errors that indicate whether something is going wrong?

Thanks @cgillum ! The math is right on point! I can't seem to get any logs from the durable framework - are they emitted by the host or the worker process? We don't use app insight, if they are emitted by the host (and hence has nothing to do with the ILogger we injected in the worker process), is there any other way to collect them, other than using app insight?

cgillum commented 3 weeks ago

I can't seem to get any logs from the durable framework - are they emitted by the host or the worker process?

They are emitted by the host process, so it would be tricky to see them if you're not using Application Insights or some other built-in mechanism for collecting logs. Log streaming is one option.

Another option is to use the Diagnose and Solve blade from the Azure Portal. We have some tools that can perform automated diagnostics specifically for Durable Functions. More information here: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-troubleshooting-guide

dxynnez commented 2 weeks ago

@cgillum I turned on app insight but I don't see any traces for warning / error (I do see information so the framework does emit logs to app insight).

Any other ways to pinpoint the problem?

dxynnez commented 2 weeks ago

I discovered something more interesting... So after reading some of the docs I thought this might have something to do with the control queue and hence I doubled the 'controlQueueBufferThreshold' (from 256 to 512) - but this didn't seem to impact the performance at all. I then ran the durable function detector and noticed that, after I switched to the second setup (12 I1V2 workers to 6 I2V2 workers) 1) the average control queue message age reduced by an order of magnitude, across all 4 partitions (even without me increasing the controlQueueBufferThreshold) 2) the average workitem queue message age increased by an order of magnitude

Does any of these make sense?

Also, I don't fully understand how the partitioning works with durable function + azure storage, but looks like the default 'partitionCount' is set to 4 - does that mean only 4 workers can handle the works concurrently? If so that would be more confusing that how would 12 workers perform better than 6 workers...

cgillum commented 2 weeks ago

Also, I don't fully understand how the partitioning works with durable function + azure storage

There is some good information on how this works in this article: https://learn.microsoft.com/azure/azure-functions/durable/durable-functions-azure-storage-provider#orchestrator-scale-out

TL/DR is that only 4 workers can handle orchestrator tasks (which come from the control queues) but all workers, regardless of partition count, can handle activity tasks (which come from the single work item queue).

If you're seeing control queue message age being lower in (1), it's likely because the activity tasks can be distributed across more workers, leaving more CPU cycles available on the workers that manage the control queue partitions. I think this may be especially relevant if the activities you're running are potentially CPU intensive.

dxynnez commented 2 weeks ago

Thanks @cgillum . Actually the control queue message age is lower in (2), by an order of magnitude at least. And 2) has more CPU headroom. The workitem queue message age is higher in 2), by an order of magnitude - I assume this mean that 2) can process more orchestrator tasks but less activity tasks.

The activities shouldn't be CPU intensive - they are just doing a Task.Delay(100).

cgillum commented 2 weeks ago

Oh, my mistake - I misread and misremembered. 😅 If the activities are just doing a Task.Delay(100), then they'll have very little CPU overhead.

I'd expect the majority of the CPU usage to be focused on control-queue message processing. If control queue ages are lower in (2), then it's probably because each of the four partition-owning workers has more CPUs to process their respective control queues. This extra per-worker CPU power translates to lower control queue ages.

dxynnez commented 2 weeks ago

@cgillum thanks now I understand that the control queue is for the orchestrator itself. I think then for control queue that would make sense.

But I don't understand why the workitem queue ages are much higher with the 2) setup - per my understanding the activity functions can be evenly distributed across all the workers as there is no 'partition' involved. The overall concurrency for activity functions should then be the same for both 1) and 2) which in theory should result in the same workitem queue age, while in fact the 2) setup has 10x higher workitem queue age comparing to the 1).

cgillum commented 2 weeks ago

But I don't understand why the workitem queue ages are much higher with the 2) setup

This one is harder to say. I think we'd need a bit more data, for example, to see whether all activity invocations are being evenly distributed across all workers.

dxynnez commented 2 weeks ago

@cgillum

Yes they are being evenly distributed across all workers, per our telemetry.

Is there any detector to show the average work items in the queue, for a particular taskhub instance?

On I2V2, it looks like no matter how high I set the maxConcurrentActivityFunctions to (we are talking about thousands here while the default is 10 * 4cores), the activity functions can be completed per minute on a single worker is always capped at around 6k, which does align with the performance target described here, though - but it was running on a A1 VM (single core) while for us we are using I2V2 (4 cores).

It's either that there aren't enough activity tasks in the queue (less likely given that the messages in the work item queue do have a higher age which should indicate activity backlogs), or something was preventing the worker to run the 'maxConcurrentActivityFunctions' activities concurrently - is there any host level setting to limit the maximum concurrency for all types of functions, or some queue related setting to limit the number of workitem messages that can be dequeued at a time so that a host process just can't provide enough concurrent work items to a single worker process?

Azure / azure-functions-durable-extension