Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.47k stars 4.8k forks source link

[BUG] Servicebus.ManagementClient enters permanent error state during failover #7454

Closed tollstorff closed 4 years ago

tollstorff commented 5 years ago

I created an issue 19 days ago, which I believe was prematurely closed. I haven't received a response to my latest comment for 13 days, so I will create a new one here.

Problem An instance of Microsoft.Azure.ServiceBus.Management.ManagementClient pointing to an alias for a Premium Service Bus Geo-Recovery setup, can enter a permanent error state during failover.

Please read the original issue and comments for a more elaborate description.

Steps to reproduce

  1. Pair two service buses (Premium Service Bus Geo-Recovery)
  2. Start a console app that calls this method against the alias
public static async Task FailsAndNeverRecovers() {
    var client = new ManagementClient(AliasConnectionString);
    var topic = await client.CreateTopicAsync(new TopicDescription("topic"));
    while (true)
    {
        try
        {
            await client.TopicExistsAsync(topic.Path);
            Console.WriteLine("Topic exists.");
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
        await Task.Delay((int)TimeSpan.FromSeconds(2).TotalMilliseconds);
    }
}
  1. Manually trigger failover in the Azure portal UI
  2. Observe that the method above starts failing and never recovers

Why was it closed? I was asked to move it to a support ticket:

We will have to investigate in our logs. Could you open a support ticket on Azure support and share the appropriate details about the namespace? I'll close the issue and we can track this via support.

Why reopen? Please read this comment from the original issue:

I can run another script (See "FailsButRecovers" here), which can successfully recover by waiting a few minutes. If I run these two programs at the same time against the same alias, their individual behavior is intact - one will permanently fail, while the other will recover after waiting a few minutes (I set the waiting time to 4 minutes in this test). This makes it seems like a client-side issue and since it is simple to reproduce using two fresh service buses, I don't think it is efficient to move this over to support.

Specifically it seems that if an error occurs within a single instance of a management client, it will somehow cache this error state, and the program needs to wait at least 2 minutes for it to clear the cache.

FYI @axisc

ghost commented 5 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl

axisc commented 5 years ago

@tollstorff

Thanks for bringing this to the attention again! My initial guess was that it might be caused due to a race condition on the backend, but I think you are right, there may be client side caching that is causing this. I'm afraid this will need more investigation on the client side.

Is there a specific use-case you are looking to support?

The default retry policy is built into the client and will essentially automatically reconnect with the new namespace. Is that failing for you as well?

tollstorff commented 5 years ago

The default retry policy is built into the client and will essentially automatically reconnect with the new namespace. Is that failing for you as well?

This is what seems to be broken. I have never seen the provided script recover.

Is there a specific use-case you are looking to support?

I just want this automatic reconnect you mention to work. It only seems to work if there's a few minutes without any interaction with the management client. With frequent interaction, like in the provided script, it never seems to recover.

Are you able to reproduce this? Two fresh service buses and the provided script should be enough.

EDIT: @axisc

jfggdl commented 5 years ago

@tollstorff, we are not able to reproduce your issue. In order to being able to analyze the problem, would you please share your subscription id, namespace, and correlation id of the error?

jfggdl commented 5 years ago

@tollstorff, Have you secured the data requested? You are welcome to create a support request to analyze in detail what caused the failures to reconnect. Please share any support request id here.

jfggdl commented 5 years ago

@tollstorff, please do not send your subscriptionid on this channel (it is fine when creating a support request). The other pieces of data are ok to share on this forum, but the best support channel for this kind of issue would be to create a support request.

tollstorff commented 5 years ago

@jfggdl:

the best support channel for this kind of issue would be to create a support request

I have already explained in detail that it seems the issue is client side, which is why I have re-opened this issue. The main problem seems to be:

it seems that if an error occurs within a single instance of a management client, it will somehow cache this error state, and the program needs to wait at least 2 minutes for it to clear the cache.

As a response to my description and observations @axisc said:

I think you are right, there may be client side caching that is causing this. I'm afraid this will need more investigation on the client side.

That said, I understand that if you cannot reproduce this issue, it will need to be closed. I will get back to you this week either with more details or to close this issue.

jfggdl commented 5 years ago

@tollstorff, I will not close this issue. We will wait for any additional details/findings that you have.

tollstorff commented 5 years ago

@jfggdl @axisc

I think we need for you to be able to reproduce the issue. I have provided a more detailed list of reproduction steps below, and have verified that I still see the problem today using these steps. Can you try to follow these steps?

Azure portal setup

I will list the settings I used. Leave the rest as their default values and choose your own names.

  1. Create a new resource group

    Region: (Europe) West Europe
  2. Add the primary service Bus

    Pricing tier: Premium
    
    Resource group: The one created in step 1.
    
    Location: West Europe
  3. Add the secondary service bus

    Pricing tier: Premium
    
    Resource group: The one created in step 1.
    
    Location: West US
  4. Pair the two service busses
    1. Click on Geo-Recovery for the primary service bus (from step 2) and click Initiate pairing
    2. Choose Use existing and select the secondary service bus (from step 3)
  5. Get a connection string for the alias
    1. Click on Geo-Recovery for either service bus to enter the new alias page
    2. Click on Shared access policies
    3. Click on RootManageSharedAccessKey
    4. Save the Alias Primary Connection String for later

Local setup

  1. Create a new .NET Core console app project in Visual Studio 2019

  2. Install the Microsoft.Azure.ServiceBus NuGet package v. 4.0.0

  3. Replace the content of Program.cs with the following:

    using Microsoft.Azure.ServiceBus.Management;
    using System;
    using System.Diagnostics;
    using System.Threading.Tasks;
    
    namespace FailoverIssue
    {
        class Program
        {
            private const string AliasPrimaryConnectionString = "Use your own";
    
            static async Task Main(string[] args)
            {
                var client = new ManagementClient(AliasPrimaryConnectionString);
                var topic = await client.CreateTopicAsync(new TopicDescription("topic"));
    
                Stopwatch timeSinceFirstError = new Stopwatch();
                while (true)
                {
                    try
                    {
                        await client.TopicExistsAsync(topic.Path);
                        Console.WriteLine("Topic exists");
                    }
                    catch (Exception ex)
                    {
                        timeSinceFirstError.Start();
                        Console.WriteLine($"{timeSinceFirstError.Elapsed.Minutes} m {timeSinceFirstError.Elapsed.Seconds} s since first error:");
                        Console.WriteLine(ex.Message);
                    }
                    await Task.Delay((int)TimeSpan.FromSeconds(2).TotalMilliseconds);
                }
            }
        }
    }
    
  4. Add the primary connection string for the alias (see Azure portal setup above) to the const field.

Test

Once the pairing is complete (pairing status: Paired) and you have the console app set up, you are ready to test:

  1. Run the console app
  2. Verify that it continously writes "Topic exists"
  3. Go to the Azure portal, then the alias page, and trigger the failover
  4. Verify that the console app will start failing (namespace does not exist) within 1-2 minutes
  5. Once it has failed for the first time, it will never recover. Wait 20 minutes (I added a stopwatch in the code to help with this).
  6. After 20 minutes, verify that the console app is still failing. This is the problem. It doesn't recover.

NOTE: As previously mentioned, I have observed that increasing the Task.Delay to 2 minutes will make it recover immediately. This is why I think there's a caching issue.

jfggdl commented 5 years ago

@tollstorff, thanks for the detailed repro steps. We will try to reproduce the problem. Do you have some cache (proxies, etc.) on the path to the service?

tollstorff commented 5 years ago

@jfggdl Thanks. No, not that I know of.

nemakam commented 5 years ago

@tollstorff, thanks for the detailed steps. It does look like a caching issue that's happening at the HttpClient layer. Used wireshark and realized that even though failover did trigger, the HttpClient continued to talk to older IP address and not the new IP address. However, dns resolution of the alias itself was pointing to the right place. Essentially, the .net library caches the DNS resolution which is still pointing to the older namespace (the primary namespace before failover) and never got invalidated.

Looks like its a known issue and could find some discussion around it here: https://github.com/dotnet/corefx/issues/11224#issuecomment-415845645 https://medium.com/@tsuyoshiushio/cleaning-dns-cache-with-azure-search-client-14cbd73d8195

But here's the bad news: The fix they have proposed only exists on .net core and not on .net standard. SocketsHttpHandler class isn't present in .net standard.

The other alternative solutions for now that I can think of is that we recycle the ManagementClient every minute, essentially forcing it to close the connection and recreate the connection (which is kind of what SocketsHttpHandler would also do). Is this a viable approach for you to take?

tollstorff commented 5 years ago

Thanks @nemakam, great that you found the underlying issue.

The other alternative solutions for now that I can think of is that we recycle the ManagementClient every minute, essentially forcing it to close the connection and recreate the connection (which is kind of what SocketsHttpHandler would also do). Is this a viable approach for you to take?

I can definitely write some workaround for this issue on my side. In addition to your suggested solution, I could probably also recycle it when there's an error, and otherwise keep it.

Still, from the point of view of a regular Azure SDK user, who has not seen this thread, this seems like unexpected behavior. It would be nice if this could be handled within the SDK. One could argue that the whole point of the failover mechanism is to seamlessly switch to the secondary namespace.

@axisc previously mentioned that:

The default retry policy is built into the client and will essentially automatically reconnect with the new namespace. Is that failing for you as well?

which would indicate that an automatic reconnect to the new namespace is supposed to be handled by the SDK. To me it seems like an edge case bug in that functionality.

nemakam commented 5 years ago

@tollstorff , I agree. We can add recycling logic into our SDK. Maybe have a param to control that. I'll leave this issue open to take that up as a bug fix. Feel free to raise a PR.

jfggdl commented 5 years ago

We have identified some potential solutions and we will be studying the feasibility of each. Stay tuned.

DorothySun216 commented 4 years ago

@tollstorff We have checked in some changes to our backend service on this issue and the changes has already been in production now. You do not need to update your Nuget package but it doesn't hurt to get the latest version. I run your repro example again and not seeing the failure now. Can you please try to see if the problem has been fixed for you? Thanks so much. Sorry about the delay.

axisc commented 4 years ago

@tollstorff - checking in to see if this is still an open issue for you. If not, please let us know so that we can close it.

tollstorff commented 4 years ago

Hi @DorothySun216 and @axisc, I appreciate you getting back to me on this.

I am currently focusing my attention elsewhere, and will not be able to verify if the changes have resolved this issue. It sounds promising that the reproduction steps did not cause any issues for @DorothySun216 after these changes.

Because of the above, I think this issue has reached its conclusion for now and you are free to close it.