Azure Functions not dequeing messages under load

asalvo commented 6 years ago

I have a function app running 1.0.13 of the .Net Functions SDK, deployed to a US North Central consumption based plan, configured for 2.0 beta. Under high loads, messages which have been dequeued more then the configured maxDequeueCount are not being moved to the poison queue. When I finally decided to stop my function so I could observe the queue, the first 32 messages had dequeue counts of 65-69 while the configured maxDequeueCount was 3.

When I first deployed my function, I had a bug, and observed the message getting moved to the poison queue as expected. After fixing that bug and running a few more tests with 1-10 messages, I enqueued 1000 messages to be worked. Each message results in anywhere from 100 to 5000 messages to be enqueued onto a second queue. The code for the 2nd queue will get a list of blob directory contents, and then enqueue up to 4 messages on a 3rd queue.

I am getting regular exceptions from the Health Monitor related to the Http Connnection Limit being exceeded. Other then that AppInsights is reporting that my "requests" are all succeeding.

269 of the 1000 messages never finished (for some reason I have yet to determine), and were continually retried until I stopped the function. 5 messages did get moved to the poison queue, but I have no idea what there dequeue count was prior to them finally being moved. Based on when they were inserted, I have to imagine that the dequeue count was > 3.

Investigative information

Please provide the following:

Timestamp: 6/19/2018, 15:11:03 UTC
Function App version: 2.0-beta
Function App name:
Function name(s) (as appropriate): Backup-LegacyMemberContainer-ProcessMember
Invocation ID: fd504dcc-39da-4ecd-a05d-d3a5c30ee97a
Region: North Central US
Consumption Base Plan: Yes

Repro steps

See above. I have not tried to distill this down into a self contained repro case yet.

Expected behavior

Queue messages are moved to the poison queue when there dequeue count exceeds the maxDequeueCount at all times.

Actual behavior

Queue messages are showing a dequeue count of 65-69 while maxDequeueCount is set to 3. This seems to occur only under high load.

Known workarounds

None that I know of at this time.

asalvo commented 6 years ago

After doing some more googling and re-reviewing what I was seeing with my function app, it seems like the problem is not with messages not being moved to the poison queue, but instead the messages are not properly dequeued/deleted.

https://github.com/Azure/azure-webjobs-sdk/issues/1040#issuecomment-394882511

fabiocav commented 6 years ago

@brettsam - can you comment?

brettsam commented 6 years ago

It sounds like there's a few issues here --

Too many network connections (are you still seeing this?)
Dequeue count is too high, possibly from failure to renew visibility timeout.

When you get too many threshold errors (and do not recover), we will restart your host. It's also possible that this happens frequently enough that you're never able to actually complete many of your functions. This would result in really high dequeue counts, but we'd never move anything to the poison queue (this only happens on failure).

How long are your functions typically taking to run? Have you figured out the error with your connections? Was the URL in the error helpful? It should lead you to: https://docs.microsoft.com/en-us/azure/azure-functions/manage-connections

asalvo commented 6 years ago

I had to keep the project moving forward, so I redeployed to a App Service Plan and my problems went away.

Average execution time as reported by App Insights

Function 1: 3.4 minutes
Function 2: 785ms

Function 1 processes queue 1 (this is the queue I posted the screen shot of) and reads data from a Sql Azure database, enqueuing 50-5000 messages to be processed by Function 2. Function 2 lists the blobs in the specified path, and then enqueues 2-4 messages into a 3rd queue. The 3rd queue is processed by a separate function app.

It does seem like the host was restarting and thus the messages were continuously retried. As to the URL in the error link it gave me a better idea of what might be happening, but I still have some questions.

I'm using a Queue Trigger on the functions, but then my own code to enqueue and work with blobs (as well as SQL, but since that's not HTTP, that shouldn't be a contributing factor). I cache the Queue Client and Blob Client objects in my code. So does this mean that I would have 3 potentially HTTP connections open per function app instance? And on a consumption based plan, where I have no control over the scale-out, I could end up with 200 instances of my function app running which would be 600 potential http connections?

brettsam commented 6 years ago

I'm glad you were able to move forward. That does point toward the connection threshold being your underlying issue, then.

To answer one of your questions -- the connection limit is per-sandbox (which really means per-instance). So if you had 200 instances, each of them would have a 300 connection limit. So the scale out shouldn't directly factor into the threshold.

To try to answer some more questions :-) --

Another setting (this one from .NET), the ServicePointManager.DefaultConnectionLimit comes into play as well. I wrote up some scenarios a while back in an issue here: https://github.com/Azure/azure-functions-host/issues/2085#issuecomment-341530681. That may give you a little more detail.

We've since set the ServicePointManager.DefaultConnectionLimit to 50 in the Consumption Plan.

So, it's possible that if your site is sending requests to 5+ different client/host combinations, you're maxing out this 50 limit on all of them and hitting the limit. And unfortunately we don't have a way for you to change the DefaultConnectionLimit today.

There may be a way to get things smoother if you're working in the Consumption Plan, though. Some questions:

If you use an Queue output binding for writing out queue messages (rather than your own client), you'll be re-using our internal client, which will remove a client usage.
If you can't do that -- is this a C# app? If so, are you using the same storage account for all of your queues and blobs? If yes to both, you can actually add a new binding to your app and request a CloudQueueClient directly. That will re-use the same one that we use internally and get rid of at least one of your client usages.
For your client caching -- are you caching app-wide? I.e. -- do Function1 and Function2 use the same CloudQueueClient?

I can give you more detail on any of these scenarios if you want to try one out.

asalvo commented 6 years ago

Question 3: Yes, the caching is client wide using static collections.

Question 1/2: Yes, this is a c# app. I was thinking that if we used a Queue output binding that it would help cut down on the number of clients, but was unaware of the CloudQueueClient binding. While that might help in this specific case, it seems like I would be just setting myself up for failure later on. As soon as we need to add support for an HTTP client not managed by Azure Functions, we'd be back to where we started (i.e. connecting to any number of 3rd party services). And while this set of functions do work with a single storage account, we have a couple of use cases where we do work with multiple storage accounts (these are implemented via WebJobs currently). Also, the more bindings we add, the harder it is to unit test and setup IoC.

Coming from WebJobs, and even Azure Functions on an App Service Plan, consumption based functions have a lot of limitations to deal with. I would steer clear of consumption based functions completely, but the allure of potentially cheaper compute costs always has management asking about them.

awithy commented 4 years ago

This thread has been quiet for a while, and I can confirm I'm experiencing similar behavior, and it seems to be when the function (or function app) is under CPU load (UPDATE: Not sure CPU load has anything to do with it).

Scenario:

Consumption-based plan
Enqueue 500 messages to storage queue trigger
Perform a few seconds of CPU-bound operation on each message
Queue batch size of 1

Observed behavior:

Some messages are dequeued but never hit my function handler (I am almost positive of this). This seems to occur in the ballpark of 0.5-1% of messages.
No exceptions or failures in either the functions tracing or my application's
Once the visibility timeout expires (I can't seem to be able to change this value BTW! - seems to be hard-coded in the SDK) the message will reappear and successfully process

My current hunch is that the CPU operations are starving the host daemon processes and causing the host to be restarted? (UPDATE: Not sure it's CPU starvation - I tried various things to run at less than 100% CPU and this problem persists)

Clearly this is less than ideal behavior and am happy to try and provide more info.

brettsam commented 4 years ago

@awithy -- If you could share your Function App name (either explicitly or privately) along with a timeframe when you see this, I can see if I find anything from the backend logs.

awithy commented 4 years ago

@brettsam sure thing: gicalcapifadev11, gicalcapifadev12, gicalcapifadev13

This issue was hitting me over most of the last two days. A good period would be between 0900 and 1300 PT yesterday (11/7), where I saw a lot of lost messages.

I can almost certainly reproduce this, but don't have time to put together a sample app. Thanks for digging into this.

brettsam commented 4 years ago

I just got around to taking a look at this. I do see a handful of errors with a stack like:

System.NullReferenceException : Object reference not set to an instance of an object.
   at async gicalcfunc.FunctionExecutor.Run(String messageBody,Nullable`1 queueWaitTime,Int32 dequeueCount) at C:\projects\prometheus\gi-calc-services\src\gicalcshared\Functions\FunctionExecutor.cs : 82
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async gicalcfunc.FunctionExecutor.Run(String messageBody,Nullable`1 queueWaitTime,Int32 dequeueCount) at C:\projects\prometheus\gi-calc-services\src\gicalcshared\Functions\FunctionExecutor.cs : 137
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async gicalcfunc.FunctionsCommon.Run(String functionName,CloudQueueMessage message) at C:\projects\prometheus\gi-calc-services\src\gicalcfunc\FunctionsCommon.cs : 32
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async gicalcfunc.gicalctask.Run(CloudQueueMessage message,ILogger log) at C:\projects\prometheus\gi-calc-services\src\gicalcfunc\gicalctask.cs : 16
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker`2.InvokeAsync[TReflected,TReturnType](TReflected instance,Object[] arguments) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs : 20
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync[TReflected,TReturnValue](Object instance,Object[] arguments) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs : 52
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeAsync(IFunctionInvoker invoker,ParameterHelper parameterHelper,CancellationTokenSource timeoutTokenSource,CancellationTokenSource functionCancellationTokenSource,Boolean throwOnTimeout,TimeSpan timerInterval,IFunctionInstance instance) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 585
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance,ParameterHelper parameterHelper,ILogger logger,CancellationTokenSource functionCancellationTokenSource) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 532
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(??) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 468
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at async Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(??) at C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs : 278

Are these understood errors? It looks like they're happening in your FunctionCommon \ FunctionExecutor code.

awithy commented 4 years ago

Hi @brettsam, we were doing a ton of active develpoment, so I'm not surprised there are a bunch of exceptions. If an exception is thrown in our handler, we will get it in app insights.

When diagnosing this issue, I put a try catch around the entire function handler and had it synchronously log to blob storage (just in case there was something strange going on with our app insights implementation). Nothing.

Probably the bigger issue here (than the fact that messages are seemingly being dequeued and not given to our handler) is that we are unable to set the visibility timeout to anything but 10 minutes. If we could shorten this, we would be able to recover from these issues (10 minutes is way to long for our use case). This seems to be a limitation of the current SDK from my understanding of: https://github.com/Azure/azure-webjobs-sdk/blob/bd5891d622c06bbed5141beb7332c9b1b1ab6a93/src/Microsoft.Azure.WebJobs.Host/Queues/Listeners/QueueListener.cs#L78

brettsam commented 4 years ago

I see you're using 3.0.4 of Microsoft.Azure.WebJobs.Extensions.Storage -- we have a lot more logging and error handling in newer releases. Are you able to move forward? That will give us details (in our backend logs) about message dequeues, etc. We also have a deadlock that comes from the underlying Storage SDK that we have worked around temporarily.

If you're able to move forward and continue seeing this -- send me the timeframe and I should be able to get a lot more details about it. If you have specific message ids that are hitting this, that would be helpful as well.

awithy commented 4 years ago

Hi @brettsam This is no longer a blocking issue for us as we have worked out an alternate design. I should be able to try this out in early December and let you know how it goes. Feel free to reach out directly.

Azure / azure-functions-host