Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.35k stars 4.66k forks source link

[BUG] `ServiceBusProcessor` tries to reconnect without any delay if there is an issue #31648

Open ilya-scale opened 1 year ago

ilya-scale commented 1 year ago

Library name and version

Azure.Messaging.ServiceBus 7.10.0

Describe the bug

I have created an instance of the ServiceBusProcessor via a client that uses DefaultAzureCredential. In my Environment Variables setup I have configured wrong ClientId (that does not exist in the Azure AD).

It seems that in this case the processor has an infinite reconnect queue with an OnError callback. I do have a logging on that event and it generated enormous amounts of logs (as probably ddosing the Azure AD).

Expected behavior

There should be some default backoff strategy (like e.g. exponential) for the connection attempt. In addition to the default backoff strategy it should ideally be a configuration setting to configure your own.

Actual behavior

It seems it is an infinite loop for reconnect without any timeout.

Reproduction Steps

csharp

var client = new ServiceBusClient(settings.FullyQualifiedNamespace, new DefaultAzureCredential());
_processor = client.CreateProcessor(
            settings.TopicName,
            settings.SubscriptionName,
            new ServiceBusProcessorOptions
            {
                MaxConcurrentCalls = 5 // If we use the default 1, then every error will completely block the processor for the duration of the retry timeout
            });

_processor.ProcessMessageAsync += OnMessage;
_processor.ProcessErrorAsync += OnError;
await _processor.StartProcessingAsync(cancellationToken);

In my case I had the service principal credentials supplied from env variables, and the client id was wong (just some string like my-test-app).

Environment

dotnet 6 running on ubuntu linux in aks

jsquire commented 1 year ago

Hi @ilya-scale. The processor follows the configured retry policy for all operations, including attempts to connect. The retry policy can be configured via the ServiceBusClientOptions used to create your client, as demonstrated here.

The processor is highly resilient and will attempt to recover from exceptions indefinitely. Since the processor lacks insight into your application and host environment, each exception is considered individually. Since you're configured for 5 concurrent calls, it's likely those retries will overlap and show a consistent pattern of activity. This is by design, as we intentionally want the favor recovering more rapidly and maximizing throughput.

I'm going to mark this as addressed, but feel free to unresolve if you'd like to continue the discussion.

ghost commented 1 year ago

Hi @ilya-scale. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

JoshLove-msft commented 1 year ago

I think part of the issue here is that when an error is not considered retriable, there is no delay between failures. Even for retriable errors, there would be no delay between the final attempt and the first attempt of the next cycle.

As a workaround, you can add a delay in your error handler for UnauthorizedAccessExceptions. Because the processor will await the error handler, this will prevent the scenario where the processor does not wait between attempts.

  internal static async Task OnError(ProcessErrorEventArgs args)
  {
      if (args.Exception is UnauthorizedAccessException)
      {
          await Task.Delay(5000);
      }
  }

I do think it is worth considering adding some handling in the library itself for this as I have heard of other users hitting this issue when they have misconfigured something.

ilya-scale commented 1 year ago

/unresolve

ilya-scale commented 1 year ago

@jsquire I do not believe there is something that is not quite correct with your statement. The link that you have provided shows that ServiceBusClient can be configured with retry options. It also does say that there is a default (which seems correct as I have checked the code of the library). It is already set up by default with 1 minute timeout and max 3 retries.

  1. You do say though that there will be infinite retries, while the default says 3 retries with the timeout up to 1 minute
  2. This does not seem to be the case for what I see either, since there are infinite retries without any delay

So I do believe this timeout has no effect on the connection attempt or maybe I do misunderstand something in here?

P.S. I do see over 100 retry attempts every second without any delay

ilya-scale commented 1 year ago

@JoshLove-msft It makes sense what you say. I am not so worried for this exception, as this is generally happen during original set up and should not persist (exactly this UnauthorizedAccess). I was more worried what will happen if there will be some SB downtime or network issues: will there be infinite retry without delay as well?

Another question is: does it make sense to not do the retry for all of the listeners at the same time, since they will all hit the same error? I.e.: all of the listener threads except one should stop retrying until that one listener connects?

jsquire commented 1 year ago

@jsquire I do not believe there is something that is not quite correct with your statement. The link that you have provided shows that ServiceBusClient can be configured with retry options. It also does say that there is a default (which seems correct as I have checked the code of the library). It is already set up by default with 1 minute timeout and max 3 retries.

The retry policy applies to an individual operation. For example: "Let's connect to Service Bus". That one operation is allowed 3 retries in the case of transient failures, with each try taking up to 60 seconds and then an exponential back-off between them.

However, the processor itself is built to be resilient. If an attempt at an operation fails, the processor will keep attempting to make forward progress. The task for that failed "Let's read a message" is faulted and the processor will discard it and start another. This is where you'll see the potential for overlap between concurrent attempts and, as Josh pointed out (and I forgot to mention), where you'll see a non-transient error cycle much more quickly.

I think the disconnect in our conversation is the difference between "one operation is retried and then fails" versus "the processor will continue to try and make forward progress by continuing to spawn new operations when they fail until it is stopped."

jsquire commented 1 year ago

Another question is: does it make sense to not do the retry for all of the listeners at the same time, since they will all hit the same error?

That, unfortunately, is an incredibly difficult thing to judge in a generalized manner. A terminal error may not be a permanent one. Since the processor lacks insight into your application and host environment, it relies on your application to react to error patterns within the error handler and make the decisions appropriate for it. For example, stopping processing if errors are occurring at a rapid pace and no messages are being processed.

ilya-scale commented 1 year ago

Yes, it seems I have misunderstood it somewhat. So what is happening now is this if we follow what you wrote:

  1. One listener tries to connect and fails
  2. It waits 0.8 seconds (default retry delay)
  3. It does that 3 times and then the process starts again at 1. without any delay

If my understanding is correct that gives us under 10 retries per second for 5 listeners, while in reality I see that this is not the case and I have around 100 (actually quite over 100) retries a second.

So I do believe that @JoshLove-msft probably had a good point in there: this algorithm does not apply for the non-transient error which seems to be my case. Would it be possible to do something like this perhaps?

  1. Add some delay between failed operations (regardless of whether the error was transient or not), preferably configurable
  2. If the listener cannot connect, implement something like a circuit-breaker pattern so that not all would be pounding at once and hitting the same error
  3. Maybe apply retry policy for the non-transient errors as well, since it is hard to distinguish. E.g. a user access error can disappear when a correct access is given, so it is not quite clear cut if it is transient or not.
jsquire commented 1 year ago

Yes, it seems I have misunderstood it somewhat. So what is happening now is this if we follow what you wrote:

  1. One listener tries to connect and fails
  2. It waits 0.8 seconds (default retry delay)
  3. It does that 3 times and then the process starts again at 1. without any delay

That's generally correct for a transient error, at a high level. (Timing on the back-off isn't linear and there's some overhead in managing the tasks). This happens per degree of concurrency, since each is an independent unit.

If my understanding is correct that gives us under 10 retries per second for 5 listeners, while in reality I see that this is not the case and I have around 100 (actually quite over 100) retries a second.

For transient errors, where retries are applied, the above is true. For terminal errors - there is no back-off and operation retry. That flows like:

  1. The processor spawns a task to read messages (for each degree of concurrency)
  2. The task tries to connect and fails with a terminal exception. It is not retried
  3. The task faults
  4. The next iteration of the processor's management loop kills all faulted tasks
  5. The process starts again at 1 without any delay

The processor favors throughput in trade-offs. As a result, each reader task is managed independently and restarts for faults are performed immediately. Its goal is to resume work as quickly as possible and keep messages flowing.

jsquire commented 1 year ago

So I do believe that @JoshLove-msft probably had a good point in there: this algorithm does not apply for the non-transient error which seems to be my case. Would it be possible to do something like this perhaps?

  1. Add some delay between failed operations (regardless of whether the error was transient or not), preferably configurable
  2. If the listener cannot connect, implement something like a circuit-breaker pattern so that not all would be pounding at once and hitting the same error
  3. Maybe apply retry policy for the non-transient errors as well, since it is hard to distinguish. E.g. a user access error can disappear when a correct access is given, so it is not quite clear cut if it is transient or not.

These are something that we'd need to think through. The processor should continue to favor throughput and assume the majority case - which is that errors resolve quickly. Something opt-in and configuration-driven is worth considering. I'm going to move this over to Josh. @JoshLove-msft: would you mind taking a look at these suggestions?

JoshLove-msft commented 1 year ago

So I do believe that @JoshLove-msft probably had a good point in there: this algorithm does not apply for the non-transient error which seems to be my case. Would it be possible to do something like this perhaps?

  1. Add some delay between failed operations (regardless of whether the error was transient or not), preferably configurable
  2. If the listener cannot connect, implement something like a circuit-breaker pattern so that not all would be pounding at once and hitting the same error
  3. Maybe apply retry policy for the non-transient errors as well, since it is hard to distinguish. E.g. a user access error can disappear when a correct access is given, so it is not quite clear cut if it is transient or not.

These are something that we'd need to think through. The processor should continue to favor throughput and assume the majority case - which is that errors resolve quickly. Something opt-in and configuration-driven is worth considering. I'm going to move this over to Josh. @JoshLove-msft: would you mind taking a look at these suggestions?

Yep, I will take a closer look at this. I agree it is tricky to do without introducing new config, but at the same time this seems like something that people will not configure until they get bitten by it.

JoshLove-msft commented 1 year ago

Related - https://github.com/Azure/azure-sdk-for-net/issues/29290

jsquire commented 1 year ago

I'm going to shift this to a feature request and put it on the backlog for consideration.