Closed icecog closed 2 years ago
I increased the app service plan to 16 nodes and removed the deployment slot but to no avail.
Let me also say that, though I doubt it has any impact, we're receiving between 1000 - 3000 messages a second across all partitions.
And here is the host.json config
{
"extensions": {
"eventHubs": {
"batchCheckpointFrequency": 100,
"eventProcessorOptions": {
"maxBatchSize": 100,
"prefetchCount": 200
}
}
},
"version": "2.0"
}
But I've tried everything short of setting prefetchCount to 0.
It turns out it was the damned staging slot that was taking up the partition leases, I'm not sure how I'll solve it, but first step is to avoid using a arm template from 2015 to provision it... Even after removing the slot in its entirety the problem persisted for a little bit - or it may have been my imagination. But eventually the production slot started firing on all cylinders.
Nope, I was wrong, it's not the staging slot. :/ Can it be that there is some way that a function can shut off without releasing the lease it has? And can we fix this somehow by specifying a shorter lease time?
@icecog, do you still experiencing the issue?
I no longer work there and so have no idea if this issue persists.
Feel free to take whatever action you consider best with this issue
Closing as no relevent
Hi,
I've been having this issue for a while now and I cannot figure it out - I'd like to say I've tried everything, but I hope not.
The problem is that the Azure function consistently starves 2 partitions, a lease is taken and then they are just left there until the lease breaks (an hour or something later). Then two other partitioned are left to be starved. Whenever I redeploy the function it grabs all of them, but then soon starts to ignore two of them. It takes about 2 minutes for this problem to start.
I'm running an Azure function (v3 - project config below) that gets data from an event hub with 16 partitions. This is running on a App Service Plan with around 11 nodes (plenty of CPU to spare).
And I have a deployment slot. This is probably the error as it sometimes doesn't stop consuming the event hub after swapping... but I'm not sure as even if I stop it this problem keeps happening (90% sure, I'll double check). (I have deployed the Slot using an ARM template, maybe something in there thats messing things up? I'm using Terraform for that, and so that is why it looks weird - posted at the bottom)
I've redeployed the function in question, I've messed with every (the 3) setting in the host.json, I've even cycled the machines on the app service plan (set scale to 1 and then to 11 again). I've even tested creating a new consumer group, but the same problem remains.
And this is beyond just upgrading to the latest nuget packages and trying to see if there is anything in my code - lucky me I have a second function that doesn't show this issue to compare with but no luck. So my guess is that there is something in the extension that's causing this.
Like I said, this is just true for one of two functions who reads the exact same data. But this problem only occurs in significant effect on one of them. The other may have the same problem, but to a much smaller degree for some reason. They both have almost the same setup, output clients for Event Grid and Redis Cache (not provided by Az Function runtime). the only real difference is that the problematic function has an integration with another event hub which is managed by me through an EventHubClient.
Blow are pictures of the the event hub checkpoints as well as what the graph of delay from message enqueued on the event hub looks like.
I'm running the following (with some internal packages removed)