Open ilya-scale opened 1 year ago
Hi @ilya-scale. The processor follows the configured retry policy for all operations, including attempts to connect. The retry policy can be configured via the ServiceBusClientOptions
used to create your client, as demonstrated here.
The processor is highly resilient and will attempt to recover from exceptions indefinitely. Since the processor lacks insight into your application and host environment, each exception is considered individually. Since you're configured for 5 concurrent calls, it's likely those retries will overlap and show a consistent pattern of activity. This is by design, as we intentionally want the favor recovering more rapidly and maximizing throughput.
I'm going to mark this as addressed, but feel free to unresolve if you'd like to continue the discussion.
Hi @ilya-scale. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve
” to remove the “issue-addressed” label and continue the conversation.
I think part of the issue here is that when an error is not considered retriable, there is no delay between failures. Even for retriable errors, there would be no delay between the final attempt and the first attempt of the next cycle.
As a workaround, you can add a delay in your error handler for UnauthorizedAccessExceptions. Because the processor will await the error handler, this will prevent the scenario where the processor does not wait between attempts.
internal static async Task OnError(ProcessErrorEventArgs args)
{
if (args.Exception is UnauthorizedAccessException)
{
await Task.Delay(5000);
}
}
I do think it is worth considering adding some handling in the library itself for this as I have heard of other users hitting this issue when they have misconfigured something.
/unresolve
@jsquire I do not believe there is something that is not quite correct with your statement. The link that you have provided shows that ServiceBusClient
can be configured with retry options. It also does say that there is a default (which seems correct as I have checked the code of the library). It is already set up by default with 1 minute timeout and max 3 retries.
So I do believe this timeout has no effect on the connection attempt or maybe I do misunderstand something in here?
P.S. I do see over 100 retry attempts every second without any delay
@JoshLove-msft It makes sense what you say. I am not so worried for this exception, as this is generally happen during original set up and should not persist (exactly this UnauthorizedAccess). I was more worried what will happen if there will be some SB downtime or network issues: will there be infinite retry without delay as well?
Another question is: does it make sense to not do the retry for all of the listeners at the same time, since they will all hit the same error? I.e.: all of the listener threads except one should stop retrying until that one listener connects?
@jsquire I do not believe there is something that is not quite correct with your statement. The link that you have provided shows that
ServiceBusClient
can be configured with retry options. It also does say that there is a default (which seems correct as I have checked the code of the library). It is already set up by default with 1 minute timeout and max 3 retries.
The retry policy applies to an individual operation. For example: "Let's connect to Service Bus". That one operation is allowed 3 retries in the case of transient failures, with each try taking up to 60 seconds and then an exponential back-off between them.
However, the processor itself is built to be resilient. If an attempt at an operation fails, the processor will keep attempting to make forward progress. The task for that failed "Let's read a message" is faulted and the processor will discard it and start another. This is where you'll see the potential for overlap between concurrent attempts and, as Josh pointed out (and I forgot to mention), where you'll see a non-transient error cycle much more quickly.
I think the disconnect in our conversation is the difference between "one operation is retried and then fails" versus "the processor will continue to try and make forward progress by continuing to spawn new operations when they fail until it is stopped."
Another question is: does it make sense to not do the retry for all of the listeners at the same time, since they will all hit the same error?
That, unfortunately, is an incredibly difficult thing to judge in a generalized manner. A terminal error may not be a permanent one. Since the processor lacks insight into your application and host environment, it relies on your application to react to error patterns within the error handler and make the decisions appropriate for it. For example, stopping processing if errors are occurring at a rapid pace and no messages are being processed.
Yes, it seems I have misunderstood it somewhat. So what is happening now is this if we follow what you wrote:
If my understanding is correct that gives us under 10 retries per second for 5 listeners, while in reality I see that this is not the case and I have around 100 (actually quite over 100) retries a second.
So I do believe that @JoshLove-msft probably had a good point in there: this algorithm does not apply for the non-transient error which seems to be my case. Would it be possible to do something like this perhaps?
Yes, it seems I have misunderstood it somewhat. So what is happening now is this if we follow what you wrote:
- One listener tries to connect and fails
- It waits 0.8 seconds (default retry delay)
- It does that 3 times and then the process starts again at 1. without any delay
That's generally correct for a transient error, at a high level. (Timing on the back-off isn't linear and there's some overhead in managing the tasks). This happens per degree of concurrency, since each is an independent unit.
If my understanding is correct that gives us under 10 retries per second for 5 listeners, while in reality I see that this is not the case and I have around 100 (actually quite over 100) retries a second.
For transient errors, where retries are applied, the above is true. For terminal errors - there is no back-off and operation retry. That flows like:
The processor favors throughput in trade-offs. As a result, each reader task is managed independently and restarts for faults are performed immediately. Its goal is to resume work as quickly as possible and keep messages flowing.
So I do believe that @JoshLove-msft probably had a good point in there: this algorithm does not apply for the non-transient error which seems to be my case. Would it be possible to do something like this perhaps?
- Add some delay between failed operations (regardless of whether the error was transient or not), preferably configurable
- If the listener cannot connect, implement something like a circuit-breaker pattern so that not all would be pounding at once and hitting the same error
- Maybe apply retry policy for the non-transient errors as well, since it is hard to distinguish. E.g. a user access error can disappear when a correct access is given, so it is not quite clear cut if it is transient or not.
These are something that we'd need to think through. The processor should continue to favor throughput and assume the majority case - which is that errors resolve quickly. Something opt-in and configuration-driven is worth considering. I'm going to move this over to Josh. @JoshLove-msft: would you mind taking a look at these suggestions?
So I do believe that @JoshLove-msft probably had a good point in there: this algorithm does not apply for the non-transient error which seems to be my case. Would it be possible to do something like this perhaps?
- Add some delay between failed operations (regardless of whether the error was transient or not), preferably configurable
- If the listener cannot connect, implement something like a circuit-breaker pattern so that not all would be pounding at once and hitting the same error
- Maybe apply retry policy for the non-transient errors as well, since it is hard to distinguish. E.g. a user access error can disappear when a correct access is given, so it is not quite clear cut if it is transient or not.
These are something that we'd need to think through. The processor should continue to favor throughput and assume the majority case - which is that errors resolve quickly. Something opt-in and configuration-driven is worth considering. I'm going to move this over to Josh. @JoshLove-msft: would you mind taking a look at these suggestions?
Yep, I will take a closer look at this. I agree it is tricky to do without introducing new config, but at the same time this seems like something that people will not configure until they get bitten by it.
I'm going to shift this to a feature request and put it on the backlog for consideration.
Library name and version
Azure.Messaging.ServiceBus 7.10.0
Describe the bug
I have created an instance of the
ServiceBusProcessor
via a client that usesDefaultAzureCredential
. In my Environment Variables setup I have configured wrong ClientId (that does not exist in the Azure AD).It seems that in this case the processor has an infinite reconnect queue with an OnError callback. I do have a logging on that event and it generated enormous amounts of logs (as probably ddosing the Azure AD).
Expected behavior
There should be some default backoff strategy (like e.g. exponential) for the connection attempt. In addition to the default backoff strategy it should ideally be a configuration setting to configure your own.
Actual behavior
It seems it is an infinite loop for reconnect without any timeout.
Reproduction Steps
csharp
In my case I had the service principal credentials supplied from env variables, and the client id was wong (just some string like my-test-app).
Environment
dotnet 6 running on ubuntu linux in aks