MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.2k stars 21.35k forks source link

How does clientRetryOptions work in relation to Retry Policies? #90225

Open MartinWickman opened 2 years ago

MartinWickman commented 2 years ago

I'm trying to understand how and when clientRetryOptions and maxAutoLockRenewalDuration are used. It's not clear from the docs.

What's confusing is when you use retry policies attributes which you put on functions. It seems to me they conflict with each other, or I am just missing something crucial here?

It boils down to this:

  1. Are the retry policy attributes (such as [FixedDelayRetry]) related to the clientRetryOptions setting in host.json? Are they the same? Will one override the other or will they multiply on top of each other?
  2. How and when do the maxAutoLockRenewalDuration setting come in to play? Default is 5 minutes, but default lock duration on Service Bus is like 1 minute. Doesn't that mean the leas will expire in one minute and then after 5 minutes it will be renewed? What about when a retry policy is doing retries (possibly for hours). I don't get it.
    [FixedDelayRetry(maxRetryCount: 10, delayInterval: "00:00:30")]
    public void MyFunction([ServiceBusTrigger("%QueueName%", Connection = "Connection")])
{
    "version": "2.0",
    "extensions": {
        "serviceBus": {
            "clientRetryOptions":{
                "mode": "exponential",
                "tryTimeout": "00:01:00",
                "delay": "00:00:00.80",
                "maxDelay": "00:01:00",
                "maxRetries": 3
            },
            "maxAutoLockRenewalDuration": "00:05:00",
        }
    }

Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

mike-urnun-msft commented 2 years ago

Thank you for your feedback! We will review and update as appropriate.

mike-urnun-msft commented 2 years ago

Hello @MartinWickman - I answered your questions below:

Are the retry policy attributes (such as [FixedDelayRetry]) related to the clientRetryOptions setting in host.json? Are they the same? Will one override the other or will they multiply on top of each other?

Both retry policies are separate and layer on top of each other. As result, the total number of retries multiply. You may review this explanation where it discusses the effect of runtime retry and service bus retry policies.

How and when do the maxAutoLockRenewalDuration setting come in to play? Default is 5 minutes, but default lock duration on Service Bus is like 1 minute. Doesn't that mean the leas will expire in one minute and then after 5 minutes it will be renewed? What about when a retry policy is doing retries (possibly for hours). I don't get it.

maxAutoLockRenewalDuration is set by the Service Bus consumer/client application which in this case is Azure Functions App, whereas the Lock Duration is a setting on the Service Bus broker platform. In other words, Lock Duration is what you specify globally in your SB namespace on how long a message should be in locked mode (safely preventing other consumers from processing the same message and going into race condition) while it's being processed by a consumer application, and if there's a chance that it'll need more time, the consumer application can then set the maxAutoLockRenewalDuration setting to renew the lock duration.

Since we didn't determine any changes to this doc upon reviewing your feedback, we will now proceed to close this thread. If there are further questions regarding this matter, please reopen it and we will gladly continue the discussion.

MartinWickman commented 2 years ago

Thanks @mike-urnun-msft for you response. I have one follow up question:

Are the retry policy attributes (such as [FixedDelayRetry]) related to the clientRetryOptions setting in host.json? Are they the same? Will one override the other or will they multiply on top of each other?

Both retry policies are separate and layer on top of each other. As result, the total number of retries multiply. You may review this explanation

I do think we're talking about different things here. I am referring to the serviceBus/clientRetryOptions setting in host.json (see below). The resilient retries you are talking about is the one defined on the service bus itself, and not are not defined here for sure.

How is serviceBus/clientRetryOptions setting related to the retry policy attributes (such as [FixedDelayRetry]). Clearly there is is something I'm missing here (or I'm reading the docs wrong).

{
    "version": "2.0",
    "extensions": {
        "serviceBus": {
            "clientRetryOptions":{
                "mode": "exponential",
                "tryTimeout": "00:01:00",
                "delay": "00:00:00.80",
                "maxDelay": "00:01:00",
                "maxRetries": 3
            },
            "maxAutoLockRenewalDuration": "00:05:00",
        }
    }
MartinWickman commented 2 years ago

@mike-urnun-msft did you see my question above? I don't feel this issue is quite resolved yet.

AJMcKane commented 2 years ago

Can I bump this please, I'm currently having to debug an issue with an Azure Function that's triggered from Azure Service Bus, however via the Custom Handler Approach (so we don't have the Azure Function Attributes on our Functions' "Run" Method).

What we're seeing is the retry options not being obeyed and the docs are very unclear as to what maps to what.

ggailey777 commented 2 years ago

Retry policies going forward will only be supported for Timer and Event Hubs triggers. We've updated the docs for the retry policy GA here: https://docs.microsoft.com/azure/azure-functions/functions-bindings-error-pages#retries

ggailey777 commented 2 years ago

I should also point out those client retry options were introduced in v5.x of the extension. Are you using the latest version of the Service Bus extension?

AJMcKane commented 2 years ago

Ahh, it looks like we're on 2.0

"extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[1.*, 2.0.0)"
  },
  "functionTimeout": "01:00:00",
  "customHandler": {
    "description": {
      "defaultExecutablePath": "FunctionHandler",
      "workingDirectory": "",
      "arguments": []
    },
    "enableForwardingHttpRequest": true
  },
  "extensions": {
    "serviceBus": {
      "clientRetryOptions": {
        "mode": "exponential",
        "tryTimeout": "00:05:00",
        "delay": "00:01:00",
        "maxDelay": "00:10:00",
        "maxRetries": 5
      },
      "messageHandlerOptions": {
        "maxConcurrentCalls": 3
      }
    }
  }

I'll action that with my team and see if it helps! Thanks :)

MartinWickman commented 2 years ago

Retry policies going forward will only be supported for Timer and Event Hubs triggers. We've updated the docs for the retry policy GA here: https://docs.microsoft.com/azure/azure-functions/functions-bindings-error-pages#retries

That's quite the surprise! I'm sure lots of people are using things like [ExponentialBackoffRetry] to handle retries especially for Service Bus. Just to make it clear: Service Buss native retry support is not even close to being the same thing, and to be frank: having SB retrying the same message 10 times as fast as possible and then dead-letter it is not really helping anyone mitigate any temporary errors. What is missing is "retry with delay" and that's what the policies are (were) used for.

So what would be a reasonable migration strategy for for those people? I'm sure you have thought about that and simply just forgot to update the documentation.

AJMcKane commented 2 years ago

What I found strange coming from AWS SQS to Azure and Service bus is that retries don't re-enter the queue. My expectation would be to put the message back on the queue (at the bottom) with a minimum retry delay.

The reason we've stumbled into this area is that with the current retry behaviour, if you have a large block of messages that'll fail (say due to a transient corrupted piece of data, or temp api outage), your entire ingestion will block up as your X Functions constantly keep retrying messages instead of cycling through them in order.

ilya-git commented 2 years ago

I am also very interested in seeing what are the alternative options for this as there is so far no mechanism in Service bus as far as I am aware that allows for the delayed retry. I had in fact had to use Durable Functions to achieve this goal since even ExponentialBackoffRetry is not bullet proof, but it worked to some extent.

AJMcKane commented 2 years ago

I've updated the extensions version to [3.0.0, 4.00) which includes v5 of the extension bundle and the retries are still attempting to trigger immediately. this is a custom handler with a serviceBus "in" binding.

From what the docs say the clientRetryOptions should work in this use case?

tufberg commented 1 year ago

I also want to bump this since we're also seeing the ClientRetryOptions not being obeyed and the docs are very unclear.

cmclellen commented 1 year ago

Same here...have been struggling with this. ClientRetryOptions not being obeyed.

pferrot commented 1 year ago

+1

AnjaEndresen01 commented 1 year ago

I am also very interested in how to get retry delay to work, I have been testing with the host.json file without expected result.

AJMcKane commented 1 year ago

We've had another occurrence of our retry options not working. Does anyone know of the recommended work-around here?

Sakkie commented 1 year ago

Also struggling with this issue. Even after setting ClientRetryOptions in Startup.cs.

andrewdmoreno commented 1 year ago

Retry policies going forward will only be supported for Timer and Event Hubs triggers. We've updated the docs for the retry policy GA here: https://docs.microsoft.com/azure/azure-functions/functions-bindings-error-pages#retries

That's quite the surprise! I'm sure lots of people are using things like [ExponentialBackoffRetry] to handle retries especially for Service Bus. Just to make it clear: Service Buss native retry support is not even close to being the same thing, and to be frank: having SB retrying the same message 10 times as fast as possible and then dead-letter it is not really helping anyone mitigate any temporary errors. What is missing is "retry with delay" and that's what the policies are (were) used for.

So what would be a reasonable migration strategy for for those people? I'm sure you have thought about that and simply just forgot to update the documentation.

It would seem like many people are struggling with finding definitive guidance for this (myself included). Seeing as how functionality was removed from preview to GA to support a pattern of delaying Service Bus retries you would think the team would at least offer a recommendation to achieve the same results (delayed retries).

The question was also asked in https://github.com/Azure/azure-functions-dotnet-worker/issues/955 and went unanswered and that conversation is locked. @mike-urnun-msft I think it would help a lot of folks in this issue and in the one referenced to at least have a recommendation or official input from the team on how to best accomplish this.

Extravisio commented 1 year ago

Ok will Assured u nxt week buddy how thinks works

On Thursday, December 1, 2022, Andrew Moreno @.***> wrote:

Retry policies going forward will only be supported for Timer and Event Hubs triggers. We've updated the docs for the retry policy GA here: https://docs.microsoft.com/azure/azure-functions/functions-bindings-error-pages#retries

That's quite the surprise! I'm sure lots of people are using things like [ExponentialBackoffRetry] to handle retries especially for Service Bus. Just to make it clear: Service Buss native retry support is not even close to being the same thing, and to be frank: having SB retrying the same message 10 times as fast as possible and then dead-letter it is not really helping anyone mitigate any temporary errors. What is missing is "retry with delay" and that's what the policies are (were) used for.

So what would be a reasonable migration strategy for for those people? I'm sure you have thought about that and simply just forgot to update the documentation.

It would seem like many people are struggling with finding definitive guidance for this (myself included). Seeing as how functionality was removed from preview to GA to support a pattern of delaying Service Bus retries you would think the team would at least offer a recommendation to achieve the same results (delayed retries).

The question was also asked in Azure/azure-functions-dotnet-worker#955 and went unanswered and that conversation is locked. @mike-urnun-msft I think it would help a lot of folks in this issue and in the one referenced to at least have a recommendation or official input from the team on how to best accomplish this.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.< https://ci3.googleusercontent.com/proxy/uLL52Wzv-qUR2sxY0cf2jg_N8wdmlwOPkMfkjsXvXxSrw1yTZRUWGMwTBCEznWBds2GRSbdgJ1Gw4X4aUAUm8YkwKJYJ-HwH99M505c4uB-ggHHrm-k3jtnxJkTzkiIRtYDvo-yxkhGy5cwB5GPUnZROZNoQMAJY6J5QQbunV2s3pCRURjtnU0Q6iSRQqim3NxZMaUf0wY44S5Yt9PjWNuadT-9UhAuWtgj-4uP-jA=s0-d-e1-ft#https://github.com/notifications/beacon/A4P7XZRUB6MXTLMDCNN2UPDWLDZY3A5CNFSM5RLGKDE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOJ6DKSUQ.gif>Message ID: @.***>

mike-urnun-msft commented 1 year ago

@andrewdmoreno @Extravisio @Sakkie and the rest - My apologies for the long silence. I'll revisit the use cases here, raise this issue internally for further clarity, and share my findings with you all here.

sanjastojkova commented 1 year ago

+1 struggling with this issue, hence looking forward hearing the alternative options for the delayed retry policy

NagaMellempudi commented 1 year ago

+1 According to documentation clientRetryOptions works for transient errors. How do we evaluate/test these scenarios? We are having hard time implementing re try for service bus triggered functions.

nour95 commented 1 year ago

+1

sdg002 commented 1 year ago

I think the OP has raised some valid questions and concerns here.

While I am a big fan of Azure functions for the simplicity it provides, the documentation regarding retries needs better explanation and scenario specific elaboration.

I am reading the article titled Azure Service Bus bindings for Azure Functions

        "serviceBus": {
            "clientRetryOptions":{
                "mode": "exponential",
                "tryTimeout": "00:01:00",
                "delay": "00:00:00.80",
                "maxDelay": "00:01:00",
                "maxRetries": 3
            }

image

What am I to understand with the caveated guidance They don't affect retries of function executions ?- Does exponential backoff work for ServiceBus or not ? I need to handle transient errors.

Thanks.

sdg002 commented 1 year ago

Hello Team, Can somebody from Microsoft please confirm if exponential back-off setting under the clientRetryOptions element of host.json works for Python Azure functions ?

I am using Azure Functions Tools 4..0.4 and Python version is 3.9.7

Thanks, Sau

sdg002 commented 1 year ago

Ahh, it looks like we're on 2.0

"extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[1.*, 2.0.0)"
  },
  "functionTimeout": "01:00:00",
  "customHandler": {
    "description": {
      "defaultExecutablePath": "FunctionHandler",
      "workingDirectory": "",
      "arguments": []
    },
    "enableForwardingHttpRequest": true
  },
  "extensions": {
    "serviceBus": {
      "clientRetryOptions": {
        "mode": "exponential",
        "tryTimeout": "00:05:00",
        "delay": "00:01:00",
        "maxDelay": "00:10:00",
        "maxRetries": 5
      },
      "messageHandlerOptions": {
        "maxConcurrentCalls": 3
      }
    }
  }

I'll action that with my team and see if it helps! Thanks :)

Hello @AJMcKane , @ggailey777 Please, could one of you guide me as to how to go about installing and referencing the version 5 of the Azure extensions ?

Thanks.

AJMcKane commented 1 year ago

@sdg002 updating the version of your Azure.Functions.ExtensionBundle to the latest does this.

@mike-urnun-msft do we have any update or ETA on a solution / alternate option for this?

vadymal commented 1 year ago

@mike-urnun-msft, any updates on it?

tomkuijsten commented 1 year ago

As far as I know, there is still no solution for a real exponential backoff. I ended up building a nuget package myself, hacked some reflection in there to create a MessageAction+ binding for the ServiceBus function trigger. This can be used to backoff a message and works by completing the current message and creating a new postponed message. Although that would only work for a queue, not a topic (as you cannot create a message on a topic for just 1 subscription).

wouter-b commented 1 year ago

@tomkuijsten since we're struggling with the same I would be very interested in your nuget package, is it public?

tomkuijsten commented 1 year ago

@wouter-b I created a custom binding and a kind of extended MessageActions object, see following gists:

https://gist.github.com/tomkuijsten/aa3374c09ec0b6db4a542aa8ef343716

https://gist.github.com/tomkuijsten/87d777a0896972cc068cc0e5d4d688f2

Note: Please, don't just use this in production, it's using reflection! whaaa!

wouter-b commented 1 year ago

@wouter-b I created a custom binding and a kind of extended MessageActions object, see following gists:

https://gist.github.com/tomkuijsten/aa3374c09ec0b6db4a542aa8ef343716

https://gist.github.com/tomkuijsten/87d777a0896972cc068cc0e5d4d688f2

Note: Please, don't just use this in production, it's using reflection! whaaa!

@tomkuijsten thanks 👍 Really appreciate it!

Tcsekhar74 commented 9 months ago

Any update on this issue yet?

david-may-shift commented 9 months ago

Also interested in the response to this thread as the documentation is not clear and it is difficult to understand how it all works. thanks

otaviobertucini commented 7 months ago

It is 2024 already and no clarification or alternative were given by the Azure Functions team.

I'm facing the same problem as described here:

What I found strange coming from AWS SQS to Azure and Service bus is that retries don't re-enter the queue. My expectation would be to put the message back on the queue (at the bottom) with a minimum retry delay.

The reason we've stumbled into this area is that with the current retry behaviour, if you have a large block of messages that'll fail (say due to a transient corrupted piece of data, or temp api outage), your entire ingestion will block up as your X Functions constantly keep retrying messages instead of cycling through them in order.

@mike-urnun-msft is there any news?

Me and my team are strongly inclined to move do SQS/Lambda if nothing changes.

ajaykrishnan33 commented 7 months ago

I am facing the same issue @mike-urnun-msft, @shreyabatra4 and would appreciate if this could be addressed.

My main use case for Azure functions is for responding to service bus messages in topics, and the absence of a robust retry mechanism for this would be a complete dealbreaker and will force me to look at other options for this.

ajaykrishnan33 commented 7 months ago

After thinking a little bit about this, I don't think there exists any good generic solution for this that can work in all cases - without involving invasive changes to how Azure Service Bus works.

Retrying with any sort of delay would require the Azure functions runtime to take a lock on the message in the queue/subscription, but if the total delay exceeds the max lock duration, then the lock would have to be released and the message would continue to be in the queue/subscription.

If using a queue, the consumer could re-enqueue the message on failure, with a delay, but in case of a topic, such a thing is not possible since the message would end up being replicated to all subscriptions of the topic, which is undesirable. Fixing this is likely to require changes to the behaviour of Azure Service Bus, which would mean that is unlikely to get addressed anytime soon by Microsoft, especially for a thread related to Azure Functions.

One solution I can think off the top of my mind, is to have the function triggered by a queue instead of a topic subscription. Topic subscriptions can be created with forwarding enabled to a particular queue. Then the function can re-enqueue the message with a delay in case of error and the other subscriptions will not be affected.

EDIT: Looks like I came up with the same solution that @tomkuijsten came up with last year.

@mike-urnun-msft , @shreyabatra4 If my analysis is correct, then please do update the docs for the service bus trigger so that this confusion doesn't arise for more people.