Azure / azure-functions-kafka-extension

Kafka extension for Azure Functions
MIT License
114 stars 80 forks source link

Scale monitor not working #406

Open a99cl208 opened 1 year ago

a99cl208 commented 1 year ago

Hello,

I am experiencing an issue with the autoscaling of the Kafka function.

In this line: https://github.com/Azure/azure-functions-kafka-extension/blob/dev/src/Microsoft.Azure.WebJobs.Extensions.Kafka/Listeners/KafkaTopicScaler.cs#L101 You call the kafka server to get the latest offset of each partition. Problem is this call takes 250ms on average in our environments. And this is done sequentially, partition after partition. Also if there is multiple function in the status Function App their monitor are evaluated sequentially also. So in my case I have 13 functions, 1 topic per function with 100 partition per topic. So the evaluation of the metrics takes between 131000.250 = 350 seconds (i.e. 5.4min). In addition, by default the scale monitor only takes the samples from last 2 minutes for the scaling evaluation, meaning it will lead in my case to either 0 or 1 sample. But then in this line: https://github.com/Azure/azure-functions-kafka-extension/blob/dev/src/Microsoft.Azure.WebJobs.Extensions.Kafka/Listeners/KafkaTopicScaler.cs#L154 You are skipping the scaling decision if there is not at least 5 samples... So in my case the scaling is just not working at all.

If I take the opposite maths, you need to have at least 5 samples in 2 minutes (i am considering that we don't do fine tuning of the job host options to make this works). There is 10 seconds of wait time between 2 samplings by default, so it leaves 120 - 5*10 = 70 seconds to compute the 5 samples (i am not counting the compute time). So it means there is only 70/5=14 seconds allowed per sample. With 250ms per partition it leaves space for 56 partitions queried max (so 4 topics of 14 partitions, or 14 topics of 4 partition, any combination that leads to 56 partition), which is not viable for big projects.

So either the QueryWatermarkOffsets method is not supposed to take 250ms but more 10ms, or their is a big design limitation with the current code. I would suggest either make the partition foreach loop parallel, or to move to another method on the confluent kafka client.

As a comparison, with the Azure Event Hubs trigger I get between 4 and 5 samples per minute.

a99cl208 commented 1 year ago

If someone is experiencing the same issue, what i did as a temp fix is to change the scaler options so it uses the samples from the last 30 minutes instead of the last 2 to get 5 samples at least. However this is not really intended to be possible so I had to use a bit of ugly reflection code in the service configuration:

var scaleOptionsType = Type.GetType("Microsoft.Azure.WebJobs.Script.ScaleOptions, Microsoft.Azure.WebJobs.Script")!;
 if (scaleOptionsType != null)
{
    GetType().GetMethod(nameof(ConfigureScaleOptions), BindingFlags.NonPublic | BindingFlags.Static)!
    .MakeGenericMethod(scaleOptionsType)
    .Invoke(null, new[] { services });
}

Plus adding this method:

private static void ConfigureScaleOptions<T>(IServiceCollection services) where T : class
{
       services.PostConfigure<T>(x => x.GetType().GetField("_scaleMetricsMaxAge", BindingFlags.NonPublic | BindingFlags.Instance)!.SetValue(x, TimeSpan.FromMinutes(30)));
}

While it works, the main problem of this method is that the time it will take to scale out is limited by the time to get a sample. So in my case it is 5 minutes but i can be longer if you have more partitions to proceed. If not acceptable, another solution that might be viable but i have not tester is to replace the IScaleMonitorManager registration from the SDK by the custom one. Use the same code than the SDK implementation (here: https://github.com/Azure/azure-webjobs-sdk/blob/c926130a942794286940c91b75247ca8843245f5/src/Microsoft.Azure.WebJobs.Host/Scale/ScaleMonitorManager.cs) but in the register method check if the monitor is a kafkatopicscaler instance, and if so replace it by an implementation of your own that does not suffer from the performance issue.