Closed Peter-B- closed 3 months ago
I got a hint from the support team, that I could try WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT
= 2
.
That seems to work and will limit the number of server instances.
That seems to reduce the number of servers somewhat (currently I see 7 instances).
+@cgillum who is more of an expert on this than I am. Chris, this does seem like a large overprovisioning (up to 30 instances for 4 partitions). Is this a known issue?
It's odd that the error you get from EventHubs is Microsoft.WindowsAzure.Storage.StorageException: Operation could not be completed within the specified time.
That seems like a timeout listing the blobs, rather than a Conflict
that I would expect from a locked blob lease.
@brettsam The EventProcessorHost used by Event Hubs trigger will periodically do a list blobs operation to determine whether the leases are evenly distributed.
But yes, this does sound like major over-provisioning. Are there any other functions besides this one in the app? Also, was VNET or runtime-driven scaling configured for this app?
Also, @Peter-B- do you have a reference to the Microsoft support case that I can use to get more information about your app? I'd be interested to take a closer look at your case.
Hi @cgillum, Thanks for your support.
I guess the reference number would be 120090824005307.
There is one other function in this app, which is deactivated using the [Disable()]
-attribute. The event hub triggered function is the only one running.
There is no VNet configured.
I am not sure about the runtime-driven scaling. Where would I configure this?
Hi @Peter-B-, thanks for bringing this to our attention. I've identified some faulty partitioning logic that resulted in egregious over-scaling for your app. I'm working on a fix for this in our private repository now. In the meantime, I've mitigated the over-scaling by setting the WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT app setting to the partition count (your app was showing a setting of "__ WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT __", which I assume was a mistake).
Hi @glennamanns, thanks for addressing this issue.
The __
was intentional. As I wrote before, I set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 2
, but that didn't fix the problem. However, I got the feeling that it improved the situation, i.e. I found only 6 to 8 instances instead of 15 to 25.
I "removed" the setting by applying __
in an attempt to verify that finding before posting it here. But I didn't find time yesterday to check it.
I'm glad to hear that you found the cause of the problem and I'm looking forward to hearing from you.
Hi,
I have similar observations with our deployment, where we witness 20-22 server instances for an EH with 5 partitions. Our support ticket with MSFT helped get clarification and I am referencing the ticket here as reference. Case Number 120091023001959 I will try the WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT setting
Hi, I am not sure if WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT setting works as expected. I have this set to 5 and I can see 12 servers online and the count keeps going up
Hi, Manual Stop and Start seems to have helped contain the scaling
All of a sudden, server instances increased to 15. Manual start and stop doesn't help either @glennamanns , any comments on this?
Hi, are there any news on this topic?
@Peter-B- , do you exp. similar issues with overprovisioning even with that recommended APP setting?
Yes. This settings only mitigates the problem somewhat. It does not solve it.
Support told me that the AF team is working on a solution, but I didn't get a feedback here.
@Peter-B- and @alkreddy, the fix is scheduled to be rolled out in the upcoming platform release.
@glennamanns - Is there any mitigation to avoid the over scaling of the instance?
However, it seems we have option to limit the scaling. Could you guys try to implement as suggested in below article? https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale#limit-scale-out
@UnderShasha Is 'limit-scale-out' different than manually setting the App Configuration WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT?
@alkreddy - Functionality is same but the settings are different. Could you try that out and see if you can limit the scaling.
Hi @UnderShasha, I set the properties.functionAppScaleLimit
to 2 and checked over the last week. And I indeed never saw more than two instances at a time. I will keep an eye on it, but this really seems to solve the issue.
@UnderShasha , Same here. with properties.functionAppScaleLimit, the server scaling is limited to the property value
@Peter-B- and @alkreddy - Good news. Thank you for the update.
@UnderShasha hello, from what I understand you're from Microsoft. Can you please let us know when this is going to be fixed on the infrastructure or library level? The properties.functionAppScaleLimit, does reduce the amount of list calls drastically, but at the same it kills the very idea of automatic scalability which is important , in my case, for the scenarios with the unknown amount of data streamed, because the limit caps that scalability. This is a very essential part of functionality, and as above mentioned in the current state and without the cap the massive costs for inadequate overprovisioning (list and other operations) can be incurred. Is there a timeline on a sort of permanent solution of the problem that doesn't kill the flexibility of auto-scale?
@pragnagopa, is this can be on scale controller side?
Azure Functions on a consumption plan are automatically scaled. This can even lead to overprovisioning, as described in the docs. This means, that there are more Azure Function instances than Event Hub partitions. This is expected behavior.
However, I am experiencing a massive overprovisioning of up to 30 Azure Function Instances for 4 partitions. Those instances try to get a lease on a blob in the assigned storage account, which will fail for 26 of the 30 instances with a timeout.
These failed attempts to get a lease on the blobs causes significant costs for “lrs list and create container operations”, as well as for logged exceptions in Application Insights.
I had a discussion with Shashank from Microsoft Support and he reviewed some internal logs. There he found that • new instances are created, since messages queue up • instances are removed, since there are more instances than partitions
With this information, I believe that I encounter some issue with the scaling logic here:
https://github.com/Azure/azure-functions-eventhubs-extension/blob/7ccd930e6d2fdde64ded0ec7f540b08627739619/src/Microsoft.Azure.WebJobs.Extensions.EventHubs/Listeners/EventHubsScaleMonitor.cs#L288
Repro steps
Create an IoT Hub with 4 partitions and send messages to it from multiple devices
Create an Azure Function with an Event Hub Trigger Consume messages in batches and add some
Task.Delay()
, so that batches have more than one message.Let it run and check the number of “servers online“ in Application Insights.
Expected behavior
Since the IoT Hub has 4 partitions, I would expect to see 1-4 Azure Function instances. I would even expect a mild overprovisioning of maybe 6 instances.
Actual behavior
20 to 30 servers online and a lot of exceptions:
Known workarounds
Stop the Azure Function, wait until all instances are killed and restart it. Then the problem is gone for some time (hours to days). I.E. less than 8 instances are created.
Related information
"Microsoft.Azure.WebJobs.Extensions.EventHubs" Version="4.1.1"
"Microsoft.NET.Sdk.Functions" Version="3.0.9"