Closed tollstorff closed 4 years ago
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl
@tollstorff
Thanks for bringing this to the attention again! My initial guess was that it might be caused due to a race condition on the backend, but I think you are right, there may be client side caching that is causing this. I'm afraid this will need more investigation on the client side.
Is there a specific use-case you are looking to support?
The default retry policy is built into the client and will essentially automatically reconnect with the new namespace. Is that failing for you as well?
The default retry policy is built into the client and will essentially automatically reconnect with the new namespace. Is that failing for you as well?
This is what seems to be broken. I have never seen the provided script recover.
Is there a specific use-case you are looking to support?
I just want this automatic reconnect you mention to work. It only seems to work if there's a few minutes without any interaction with the management client. With frequent interaction, like in the provided script, it never seems to recover.
Are you able to reproduce this? Two fresh service buses and the provided script should be enough.
EDIT: @axisc
@tollstorff, we are not able to reproduce your issue. In order to being able to analyze the problem, would you please share your subscription id, namespace, and correlation id of the error?
@tollstorff, Have you secured the data requested? You are welcome to create a support request to analyze in detail what caused the failures to reconnect. Please share any support request id here.
@tollstorff, please do not send your subscriptionid on this channel (it is fine when creating a support request). The other pieces of data are ok to share on this forum, but the best support channel for this kind of issue would be to create a support request.
@jfggdl:
the best support channel for this kind of issue would be to create a support request
I have already explained in detail that it seems the issue is client side, which is why I have re-opened this issue. The main problem seems to be:
it seems that if an error occurs within a single instance of a management client, it will somehow cache this error state, and the program needs to wait at least 2 minutes for it to clear the cache.
As a response to my description and observations @axisc said:
I think you are right, there may be client side caching that is causing this. I'm afraid this will need more investigation on the client side.
That said, I understand that if you cannot reproduce this issue, it will need to be closed. I will get back to you this week either with more details or to close this issue.
@tollstorff, I will not close this issue. We will wait for any additional details/findings that you have.
@jfggdl @axisc
I think we need for you to be able to reproduce the issue. I have provided a more detailed list of reproduction steps below, and have verified that I still see the problem today using these steps. Can you try to follow these steps?
I will list the settings I used. Leave the rest as their default values and choose your own names.
Create a new resource group
Region: (Europe) West Europe
Add the primary service Bus
Pricing tier: Premium
Resource group: The one created in step 1.
Location: West Europe
Add the secondary service bus
Pricing tier: Premium
Resource group: The one created in step 1.
Location: West US
Create a new .NET Core console app project in Visual Studio 2019
Install the Microsoft.Azure.ServiceBus NuGet package v. 4.0.0
Replace the content of Program.cs with the following:
using Microsoft.Azure.ServiceBus.Management;
using System;
using System.Diagnostics;
using System.Threading.Tasks;
namespace FailoverIssue
{
class Program
{
private const string AliasPrimaryConnectionString = "Use your own";
static async Task Main(string[] args)
{
var client = new ManagementClient(AliasPrimaryConnectionString);
var topic = await client.CreateTopicAsync(new TopicDescription("topic"));
Stopwatch timeSinceFirstError = new Stopwatch();
while (true)
{
try
{
await client.TopicExistsAsync(topic.Path);
Console.WriteLine("Topic exists");
}
catch (Exception ex)
{
timeSinceFirstError.Start();
Console.WriteLine($"{timeSinceFirstError.Elapsed.Minutes} m {timeSinceFirstError.Elapsed.Seconds} s since first error:");
Console.WriteLine(ex.Message);
}
await Task.Delay((int)TimeSpan.FromSeconds(2).TotalMilliseconds);
}
}
}
}
Add the primary connection string for the alias (see Azure portal setup above) to the const field.
Once the pairing is complete (pairing status: Paired) and you have the console app set up, you are ready to test:
NOTE: As previously mentioned, I have observed that increasing the Task.Delay to 2 minutes will make it recover immediately. This is why I think there's a caching issue.
@tollstorff, thanks for the detailed repro steps. We will try to reproduce the problem. Do you have some cache (proxies, etc.) on the path to the service?
@jfggdl Thanks. No, not that I know of.
@tollstorff, thanks for the detailed steps.
It does look like a caching issue that's happening at the HttpClient
layer.
Used wireshark and realized that even though failover did trigger, the HttpClient
continued to talk to older IP address and not the new IP address. However, dns resolution of the alias
itself was pointing to the right place.
Essentially, the .net library caches the DNS resolution which is still pointing to the older namespace (the primary namespace before failover) and never got invalidated.
Looks like its a known issue and could find some discussion around it here: https://github.com/dotnet/corefx/issues/11224#issuecomment-415845645 https://medium.com/@tsuyoshiushio/cleaning-dns-cache-with-azure-search-client-14cbd73d8195
But here's the bad news: The fix they have proposed only exists on .net core and not on .net standard. SocketsHttpHandler
class isn't present in .net standard.
The other alternative solutions for now that I can think of is that we recycle the ManagementClient
every minute, essentially forcing it to close the connection and recreate the connection (which is kind of what SocketsHttpHandler
would also do).
Is this a viable approach for you to take?
Thanks @nemakam, great that you found the underlying issue.
The other alternative solutions for now that I can think of is that we recycle the ManagementClient every minute, essentially forcing it to close the connection and recreate the connection (which is kind of what SocketsHttpHandler would also do). Is this a viable approach for you to take?
I can definitely write some workaround for this issue on my side. In addition to your suggested solution, I could probably also recycle it when there's an error, and otherwise keep it.
Still, from the point of view of a regular Azure SDK user, who has not seen this thread, this seems like unexpected behavior. It would be nice if this could be handled within the SDK. One could argue that the whole point of the failover mechanism is to seamlessly switch to the secondary namespace.
@axisc previously mentioned that:
The default retry policy is built into the client and will essentially automatically reconnect with the new namespace. Is that failing for you as well?
which would indicate that an automatic reconnect to the new namespace is supposed to be handled by the SDK. To me it seems like an edge case bug in that functionality.
@tollstorff , I agree. We can add recycling logic into our SDK. Maybe have a param to control that. I'll leave this issue open to take that up as a bug fix. Feel free to raise a PR.
We have identified some potential solutions and we will be studying the feasibility of each. Stay tuned.
@tollstorff We have checked in some changes to our backend service on this issue and the changes has already been in production now. You do not need to update your Nuget package but it doesn't hurt to get the latest version. I run your repro example again and not seeing the failure now. Can you please try to see if the problem has been fixed for you? Thanks so much. Sorry about the delay.
@tollstorff - checking in to see if this is still an open issue for you. If not, please let us know so that we can close it.
Hi @DorothySun216 and @axisc, I appreciate you getting back to me on this.
I am currently focusing my attention elsewhere, and will not be able to verify if the changes have resolved this issue. It sounds promising that the reproduction steps did not cause any issues for @DorothySun216 after these changes.
Because of the above, I think this issue has reached its conclusion for now and you are free to close it.
I created an issue 19 days ago, which I believe was prematurely closed. I haven't received a response to my latest comment for 13 days, so I will create a new one here.
Problem An instance of Microsoft.Azure.ServiceBus.Management.ManagementClient pointing to an alias for a Premium Service Bus Geo-Recovery setup, can enter a permanent error state during failover.
Please read the original issue and comments for a more elaborate description.
Steps to reproduce
Why was it closed? I was asked to move it to a support ticket:
Why reopen? Please read this comment from the original issue:
I can run another script (See "FailsButRecovers" here), which can successfully recover by waiting a few minutes. If I run these two programs at the same time against the same alias, their individual behavior is intact - one will permanently fail, while the other will recover after waiting a few minutes (I set the waiting time to 4 minutes in this test). This makes it seems like a client-side issue and since it is simple to reproduce using two fresh service buses, I don't think it is efficient to move this over to support.
Specifically it seems that if an error occurs within a single instance of a management client, it will somehow cache this error state, and the program needs to wait at least 2 minutes for it to clear the cache.
FYI @axisc