Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.92k stars 440 forks source link

ServiceBusTrigger delayed for extended period of time before executing #6786

Open timgabrhel opened 3 years ago

timgabrhel commented 3 years ago

Check for a solution in the Azure portal

Nothing found.

Investigative information

This is a duplicated bug report of https://github.com/Azure/azure-functions-servicebus-extension/issues/116

The issue I am having is that my function with a ServiceBusTrigger goes dark for an extended period of time. In the linked bug report, there is a 12 hour gap of silence, then it caught up with all messages. Then, after a short period of normal behavior, it went silent again. I was able to force it to trigger by initiating a new deployment via Azure DevOps.

Please provide the following:

The information provided here is not referring to a problem invocation. I'm hoping you can use this information to trace into the "dark" periods.

The periods of silence were (all in Central time)

10/11 8:30am - 5:30pm 10/11 8:30pm - 5:30am (next day) 10/13 12:30pm - 2:15pm

Expected behavior

The ServiceBusTrigger should be firing automatically when new messages are available in the queue.

Actual behavior

The ServiceBusTrigger works for a period of time, then begins to stop triggering automatically. The first noticed period was 12 hours of silence, then it fired on it's own and caught up with all messages in the queue. Soon after, the message count rose and the functions didn't execute. There were no Application Insights logs to correlate this rising queue count. At the time of this writing, I just had this issue happen again, where the function didn't execute for around an hour. Triggering a new deployment via AzureDevOps caused the function to fire.

Known workarounds

Triggering a new deployment via AzureDevOps caused the function to fire.

Related information

Source ```csharp [FunctionName("NotificationProcessor")] public async Task Run([ServiceBusTrigger(ServiceBusNotificationsQueueName, Connection = ServiceBusConnectionName)] Message message, MessageReceiver messageReceiver, ILogger logger) { try { // read the message & process. based on the results, call abandon or complete. // DoWork(); if (success) { await messageReceiver.AbandonAsync(message.SystemProperties.LockToken, modifiedProperties); } else { await messageReceiver.CompleteAsync(message.SystemProperties.LockToken); } } catch (Exception ex) { await messageReceiver.AbandonAsync(message.SystemProperties.LockToken); logger.LogError(ex, $"NotificationProcessor Error processing {message.MessageId}"); throw; } } ```
timgabrhel commented 3 years ago

This issue was created at the request from this SO post https://stackoverflow.com/questions/64318870/servicebustrigger-delayed-in-executions

jesuissur commented 3 years ago

We got the same problem here. The messages were enqueued for about 5 hours before being processed by our service bus triggered function. It is a sporadic problem but pretty critical for production environment.

Our function app is "Always on" per Azure configuration settings. It is hosted in a premium/production windows App service plan (P2V2). We are on SDK version 3.0.9.

What can we do to help to find and, fix the problem or establish a solid workaround?

Thanks Phil

timgabrhel commented 3 years ago

@jesuissur Hey Phil, I ended up working with technical support on this and never found a specific issue I can point to. We did identify some instability in my functions app where there were a couple of invalid CRON configs for some timer apps, as well as some exception issues in other apps that were just guessing may have caused some instability in the function host causing a shutdown.

I see I've had 1 delay in the last 30 days so it's definitely not cleared up yet.

soninaren commented 3 years ago

@timgabrhel is the app configured to use runtime scaling?

timgabrhel commented 3 years ago

@soninaren Can you provide me a resource to have a look? I'm having a hard time in research to give you a confident answer.

soninaren commented 3 years ago

@timgabrhel To check the configuration for runtime scaling. open the azure portal -> <Your function app> -> Configuration -> Function runtime settings image

timgabrhel commented 3 years ago

I'm on a consumption plan and I'm not seeing the option. image

amyhu-ms commented 3 years ago

Is there any update on this issue? We are seeing similar issues last week that it took several hours from messages enqueued to being processed by our service bus triggered function. We are using Listen only access to the service bus due to business needs. Please help share any thoughts that if this may/may not cause hours delay in message being picked up.

julia-xu commented 3 years ago

Also seeking updates on this issue

jesuissur commented 3 years ago

We got the problem in our environment once again. The symptoms are a bit different though. Our function stopped processing messages (we got more than 10K in the triggered queue) and refuse to work even after many restart and deployments. The function was fine until then. It has been more than 24 hours without any processing except a few messages (50 at max) that have gone through the function for a short period of time (~15 minutes).

When we restart the function, the traces indicates several processing triggered by the service bus but nothing ever complete (no "executed" trace message).

Some facts that might help:

  1. We scale the function up to 5 instances every night and back to 1 in the day
  2. From the 10K messages that are stucked, we have several with a weird lock time pattern. The message was delivered 3-4 times and enqueued around 06:52 on dec. 15th. But the message is locked until 05:09 on dec. 16th. Almost a full day between the enqueued and the latest lock expiration after 3-4 retries. The lock duration on this queue is the default: 1 minute. Here's an example of a message we peek form the queue around 10:30 UTC dec. 16th. { "deliveryCount": 4, "enqueuedSequenceNumber": 0, "enqueuedTimeUtc": "2020-12-15T06:52:33.481241Z", "expiresAtUtc": "9999-12-31T23:59:59.9999999", "forcePersistence": false, "isBodyConsumed": false, "lockedUntilUtc": "2020-12-16T05:09:53.6549114Z", "lockToken": "4e852f23-b3f9-480b-b23d-97babbfc5e64", "messageId": "213dafa8127d4867bad1d357e5d592ef", "partitionKey": null,
  3. Our function trigger is configured with those and we run in our own App Service Plan (premium) "functionTimeout": "02:00:00", "extensions": { "serviceBus": { "prefetchCount": 0, "messageHandlerOptions": { "autoComplete": true, "maxConcurrentCalls": 80, "maxAutoRenewDuration": "02:00:00" } } },
  4. When we go to the function url (ex.: https://our-function-1.azurewebsites.net/) it does not answer (500 - Timeout). But we got another function (same code but "watching" another queue) on the same App service plan and we can go to the url and see the Azure Functions landing page

We will try to upgrade from the Microsoft.Azure.WebJobs.Extensions.ServiceBus 4.1.2 to the 4.2.0 version.

In hope it could help this major problem. Phil

jesuissur commented 3 years ago

We still got the same problem with the version 4.2.0 and 4.2.1 of Microsoft.Azure.WebJobs.Extensions.ServiceBus. Any updates on this one?

v-bbalaiagar commented 3 years ago

Hi @soninaren, Can you kindly look into this.

Ubloobok commented 1 year ago

I guess we just had a similar issue, when the message was in the service bus queue for 55 minutes (Blob Created event from Blob storage through Event Grid), and only after that Azure Function with ServiceBusTrigger consumed it. No more logs or activities. But it was only 1 case at this moment, we have not so many details.

francoishill commented 1 year ago

@timgabrhel, this is a known limitation with the ServiceBusTrigger when using the Consumption plan together with only the "Listen" policy (instead of the "Manage" policy). Refer to these docs but it comes down to the function scaler not being able to determine the queue length and then shutting down the whole app (if using Consumption plan).

I had the same issue and got it fixed by using the "Manage" policy for the connection string and removing the ;EntityPath=*** from the connection string.

For more details, I followed these steps:

Other useful links:

andrewatfornax commented 1 year ago

@francoishill, I can confirm that we have this issue with the Consumption plan even when using a ServiceBus connection string that has the "Manage" policy, so it's not just an issue with using the "Listen" policy - there are other factors at play.

mrobertsonavidxchange commented 1 year ago

I am seeing a similar problem. Sometimes, my service bus triggered function app will process messages as soon as the topic receives the message. But, in most cases, the message sits on the subscription for 30 minutes or more before the function is triggered.

ConfigurationSettings

ConfigurationSettings2

I've attached the function app settings, as well as an app insights trace showing the message being enqueued at 2:41 EST, and also showing that the service bus trigger isn't triggered by that message until 3:15 EST.

NotProcessedQuickly

andrewatfornax commented 1 year ago

Hi @mrobertsonavidxchange,

In a separate MS support ticket, I am being advised that the issue relates to the Function App service not ensuring that triggers are working because we've exceeded an undocumented maximum number of functions permitted per Function App resource.

Would you mind letting me know how many functions you have in your Function App, please? I'm curious to see if you also exceed this undocumented limit...

mrobertsonavidxchange commented 1 year ago

Just 2! A trigger and a health check @andrewatfornax

andrewatfornax commented 1 year ago

Okay, so, assuming what we're being told is correct, there are at least a couple of causes of the issue possible...

cygnim commented 1 year ago

@andrewatfornax, we are also experiencing this issue and when we turned on verbose logging, we did see a log entry that states the number of triggers that are monitored when a function shuts down is 100. This limitation, as far as we can tell, is not documented anywhere. I created a ticket to address this in the documentation here: https://github.com/Azure/Azure-Functions/issues/2372.

andrewatfornax commented 1 year ago

Thanks @cygnim - in my discussions with Azure support about this, we don't believe that we're hitting any kind of undocumented number of triggers that causes the functions to shut down - we do see a message that LOOKS like that's what's going on, but we've been assured this is not actually the cause of the issue.

It actually appears to be a kernel-level issue in the function app service.

While MS are still working on a solution, we've been advised that the function app team are working on improvements to service monitoring that will re-start the connection to the Service Bus in the event that the connection drops out, and processing of messages halts....