Excessive overprovisioning with Event Hub Trigger

Peter-B- commented 4 years ago

Azure Functions on a consumption plan are automatically scaled. This can even lead to overprovisioning, as described in the docs. This means, that there are more Azure Function instances than Event Hub partitions. This is expected behavior.

However, I am experiencing a massive overprovisioning of up to 30 Azure Function Instances for 4 partitions. Those instances try to get a lease on a blob in the assigned storage account, which will fail for 26 of the 30 instances with a timeout.

These failed attempts to get a lease on the blobs causes significant costs for “lrs list and create container operations”, as well as for logged exceptions in Application Insights.

I had a discussion with Shashank from Microsoft Support and he reviewed some internal logs. There he found that • new instances are created, since messages queue up • instances are removed, since there are more instances than partitions

With this information, I believe that I encounter some issue with the scaling logic here:

https://github.com/Azure/azure-functions-eventhubs-extension/blob/7ccd930e6d2fdde64ded0ec7f540b08627739619/src/Microsoft.Azure.WebJobs.Extensions.EventHubs/Listeners/EventHubsScaleMonitor.cs#L288

Repro steps

Create an IoT Hub with 4 partitions and send messages to it from multiple devices
Create an Azure Function with an Event Hub Trigger Consume messages in batches and add some Task.Delay(), so that batches have more than one message.
Let it run and check the number of “servers online“ in Application Insights.

Expected behavior

Since the IoT Hub has 4 partitions, I would expect to see 1-4 Azure Function instances. I would even expect a mild overprovisioning of maybe 6 instances.

Actual behavior

20 to 30 servers online and a lot of exceptions:

Microsoft.WindowsAzure.Storage.StorageException: Operation could not be completed within the specified time.
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at Microsoft.WindowsAzure.Storage.Blob.CloudBlobContainer.ListBlobsSegmentedAsync(String prefix, Boolean useFlatBlobListing, BlobListingDetails blobListingDetails, Nullable`1 maxResults, BlobContinuationToken currentToken, BlobRequestOptions options, OperationContext operationContext, CancellationToken cancellationToken)
   at Microsoft.Azure.EventHubs.Processor.AzureStorageCheckpointLeaseManager.GetAllLeasesAsync()
   at Microsoft.Azure.EventHubs.Processor.PartitionManager.RunLoopAsync(CancellationToken cancellationToken)

Known workarounds

Stop the Azure Function, wait until all instances are killed and restart it. Then the problem is gone for some time (hours to days). I.E. less than 8 instances are created.

Related information

.Net Core 3.1
Azure Function Version: V1
"Microsoft.Azure.WebJobs.Extensions.EventHubs" Version="4.1.1"
"Microsoft.NET.Sdk.Functions" Version="3.0.9"

Peter-B- commented 4 years ago

I got a hint from the support team, that I could try WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT= 2.

~~That seems to work and will limit the number of server instances.~~

That seems to reduce the number of servers somewhat (currently I see 7 instances).

brettsam commented 4 years ago

+@cgillum who is more of an expert on this than I am. Chris, this does seem like a large overprovisioning (up to 30 instances for 4 partitions). Is this a known issue?

It's odd that the error you get from EventHubs is Microsoft.WindowsAzure.Storage.StorageException: Operation could not be completed within the specified time. That seems like a timeout listing the blobs, rather than a Conflict that I would expect from a locked blob lease.

cgillum commented 4 years ago

@brettsam The EventProcessorHost used by Event Hubs trigger will periodically do a list blobs operation to determine whether the leases are evenly distributed.

But yes, this does sound like major over-provisioning. Are there any other functions besides this one in the app? Also, was VNET or runtime-driven scaling configured for this app?

cgillum commented 4 years ago

Also, @Peter-B- do you have a reference to the Microsoft support case that I can use to get more information about your app? I'd be interested to take a closer look at your case.

Peter-B- commented 4 years ago

Hi @cgillum, Thanks for your support.

I guess the reference number would be 120090824005307.

There is one other function in this app, which is deactivated using the [Disable()]-attribute. The event hub triggered function is the only one running.

There is no VNet configured.

I am not sure about the runtime-driven scaling. Where would I configure this?

glennamanns commented 4 years ago

Hi @Peter-B-, thanks for bringing this to our attention. I've identified some faulty partitioning logic that resulted in egregious over-scaling for your app. I'm working on a fix for this in our private repository now. In the meantime, I've mitigated the over-scaling by setting the WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT app setting to the partition count (your app was showing a setting of "__ WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT __", which I assume was a mistake).

Peter-B- commented 4 years ago

Hi @glennamanns, thanks for addressing this issue.

The __ was intentional. As I wrote before, I set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 2, but that didn't fix the problem. However, I got the feeling that it improved the situation, i.e. I found only 6 to 8 instances instead of 15 to 25.

I "removed" the setting by applying __ in an attempt to verify that finding before posting it here. But I didn't find time yesterday to check it.

I'm glad to hear that you found the cause of the problem and I'm looking forward to hearing from you.

alkreddy commented 4 years ago

Hi,

I have similar observations with our deployment, where we witness 20-22 server instances for an EH with 5 partitions. Our support ticket with MSFT helped get clarification and I am referencing the ticket here as reference. Case Number 120091023001959 I will try the WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT setting

alkreddy commented 4 years ago

Hi, I am not sure if WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT setting works as expected. I have this set to 5 and I can see 12 servers online and the count keeps going up

alkreddy commented 4 years ago

Hi, Manual Stop and Start seems to have helped contain the scaling

alkreddy commented 4 years ago

All of a sudden, server instances increased to 15. Manual start and stop doesn't help either @glennamanns , any comments on this?

Peter-B- commented 4 years ago

Hi, are there any news on this topic?

alkreddy commented 4 years ago

@Peter-B- , do you exp. similar issues with overprovisioning even with that recommended APP setting?

Peter-B- commented 4 years ago

Yes. This settings only mitigates the problem somewhat. It does not solve it.

Support told me that the AF team is working on a solution, but I didn't get a feedback here.

UnderShasha commented 4 years ago

@Peter-B- and @alkreddy, the fix is scheduled to be rolled out in the upcoming platform release.

@glennamanns - Is there any mitigation to avoid the over scaling of the instance?

However, it seems we have option to limit the scaling. Could you guys try to implement as suggested in below article? https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale#limit-scale-out

alkreddy commented 4 years ago

@UnderShasha Is 'limit-scale-out' different than manually setting the App Configuration WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT?

UnderShasha commented 4 years ago

@alkreddy - Functionality is same but the settings are different. Could you try that out and see if you can limit the scaling.

Peter-B- commented 4 years ago

Hi @UnderShasha, I set the properties.functionAppScaleLimit to 2 and checked over the last week. And I indeed never saw more than two instances at a time. I will keep an eye on it, but this really seems to solve the issue.

alkreddy commented 4 years ago

@UnderShasha , Same here. with properties.functionAppScaleLimit, the server scaling is limited to the property value

UnderShasha commented 4 years ago

@Peter-B- and @alkreddy - Good news. Thank you for the update.

zorge commented 3 years ago

@UnderShasha hello, from what I understand you're from Microsoft. Can you please let us know when this is going to be fixed on the infrastructure or library level? The properties.functionAppScaleLimit, does reduce the amount of list calls drastically, but at the same it kills the very idea of automatic scalability which is important , in my case, for the scenarios with the unknown amount of data streamed, because the limit caps that scalability. This is a very essential part of functionality, and as above mentioned in the current state and without the cap the massive costs for inadequate overprovisioning (list and other operations) can be incurred. Is there a timeline on a sort of permanent solution of the problem that doesn't kill the flexibility of auto-scale?

alrod commented 2 years ago

@pragnagopa, is this can be on scale controller side?

Azure / azure-functions-eventhubs-extension