Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.48k stars 4.81k forks source link

[BUG] Requests to Key Vaults are timing out after considerable time #37420

Closed pascalu closed 7 months ago

pascalu commented 1 year ago

Library name and version

Azure.Security.KeyVault.Keys 4.5.0

Describe the bug

When interacting with Azure KeyVault we occasionally notice request which seem to time out with the following exception:

System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
 ---> System.Net.Sockets.SocketException (125): Operation canceled
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
   at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Azure.Core.Pipeline.HttpClientTransport.ProcessAsync(HttpMessage message, Boolean async)
   at Azure.Core.Pipeline.HttpPipelineTransportPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline)
   at Azure.Core.Pipeline.RequestActivityPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async)

In Application Insights we can see the following telemetry. The issue only occurs when multiple requests are processed at the same time:

Request 1

Application Insights

Request 2

Application Insights 2

To summarize the above:

However, looking at the Insights of KeyVault it does at no point indicate that requests fail or have a significant latency so it seems the problem is in the KeyVault client and not caused by KeyVault itself.

The issue related to some extent to the behavior described in https://github.com/Azure/azure-sdk-for-net/issues/4016. However, as there is not much information provided in that issue it's hard to verify.

Expected behavior

Deadlocks should not occur

Actual behavior

Deadlocks occur from time to time

Reproduction Steps

The original issue is occuring in a more complex setup. However, we could reproduce it occasionally in this very minimal PoC.

However:

    public class KeyVaultService : IKeyVaultService
    {
        private readonly KeyClient _client;

        public KeyVaultService(IConfiguration configuration)
        {

            _configuration = configuration;

            var tenantId = 'tenant-id';
            var clientId = 'client-id';
            var clientSecret = 'client-secret';

            var servicePrincipal = new ClientSecretCredential(tenantId, clientId, clientSecret);

            _client = new KeyClient(new Uri("https://your-keyvault.vault.azure.net"), servicePrincipal);
        }

        public async Task<bool> CheckKey(string name)
        {
            try
            {
                var key = await _client.GetKeyAsync(name) ?? null;
                return key != null;
            }
            catch
            {
                return false;
            }
        }

    }

It's worth noting that the KeyVaultService is injected via dependency injection so a single instance is shared among requests:

public static class ServiceCollectionExtensions
{
    public static IServiceCollection InjectEncryptionServices(this IServiceCollection services)
    {
        services.AddSingleton<IKeyVaultService, KeyVaultService>();

        return services;
    }
}

The issue could be triggered by calling repeatedly/simultaneously the follwing API endpoint. The issue could be observed independently of a specific KeyVault operation (e.g. also on wrap/unwrap key)

    [HttpGet("check")]
    public async Task<ActionResult<bool>> CheckKeyVaultStatus(Guid id)
    {
        return await _keyVaultService.CheckKey(id.ToString());
    }

Environment

Container Image: mcr.microsoft.com/dotnet/aspnet:7.0

jsquire commented 1 year ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

christothes commented 1 year ago

The 401 behavior is normal and the token seems to be acquired as expected on the first call, so it doesn't sound related to Identity. @heaths , could you take a look?

heaths commented 1 year ago

The Key Vault clients don't have any synchronization now that all the challenge response was moved to core. ChallengeBasedAuthenticationPolicy is now just implementing that. All synchronization is done there. Adding @AlexanderSher and label Azure.Core for now. I suspect the issue is there but can take a look later if Alexander can't.

heaths commented 1 year ago

@pascalu have you checked k8s network logs? This appears to be a timeout based on an inactive socket connection. That should not be coming from the service itself.

After some more digging, we should be past any synchronized code by this point, so a deadlock in the client itself seems unlikely.

And while it won't necessarily fix the problem, if you're making a lot of requests to Key Vault, keeping in mind it's low rate limits, you might checkout the sample https://learn.microsoft.com/samples/azure/azure-sdk-for-net/azure-key-vault-proxy/ and consider something similar. Of course, if you're only calling crypto operations this won't help, but if you're constantly fetching keys - or you could even just cache the JWKs in memory - this could help mitigate the issue as well.

pascalu commented 1 year ago

Hello @heaths

Thanks for having a look at this and for your feedback. Our network logs in Azure Kubernetes Service give no indication that there are errors on outgoing connections during the period we could observe the bug. Additionally, we interact with several other Azure services and none of them seem to ever have the same or similar problems.

As a concrete next action we will configure a much lower network request timeout in the KeyClientOptions of the KeyClient. While this will not get rid of the problem in theory (as the request is automatically retried after the timeout as visible in the telemetry of application insights) it should still be able to mitigate the impact for now and allow us to further investigate the problem.

PS: Thanks for the hint with Azure KeyVault Proxy. Our full solution implements caching of keys already in order to reduce the dependency to KeyVault but it's a nice addition to look into.

heaths commented 1 year ago

I'll try to reproduce in the specified container image, but am puzzled since we use the same transport pipeline all the other SDKs use - Azure. SDKs, anyway; Microsoft.Azure. are most often deprecated and an entirely different codebase.

Was there more of the stack, though? Was it the GetKeyAsync method that seemed to deadlock? I'd expect the Key Vault SDK's methods to be in the stack.

github-actions[bot] commented 1 year ago

Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

pascalu commented 1 year ago

Hi @heaths

I'm not sure if I understand your comment about Microsoft.Azure.* being deprecated - could you elaborate?

There wasn't more to the stack, it turns out the GetKeyAsync method actually completes successfully after the timeout as the exception seems to be handled on transport level and is retried automatically:

So eventually it could be related or similar to the problem which is described in https://github.com/Azure/azure-sdk-for-net/issues/37477

heaths commented 1 year ago

You said you interact with different services. I was pointing out that almost all Microsoft.Azure. SDK packages are deprecated e.g., https://www.nuget.org/packages/Microsoft.Azure.KeyVault, so I don't recommend using a mix of those and Azure. SDK packages. They use different transport mechanisms.

If GetKeyAsync did complete successfully, it could be #37477 as you said.

Are you able to trace HTTP traffic as well? It would be useful to see if Key Vault is responding in time to figure out what is actually causing the SocketException. To be clear, this is the first I've seen any issue regarding it.

github-actions[bot] commented 1 year ago

Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

heaths commented 1 year ago

@pascalu in what region(s) are your vaults? Also, did this just start happening around the time you opened this issue, or has it been happening for a while?

pascalu commented 1 year ago

@heaths the Keyvaults are located in Switzerland North. It has been occurring for a while in irregular intervals in the month before we submitted the issue.

We'll try to get more insights in to the HTTP traffic to be able to pinpoint why the SocketException occurs.

github-actions[bot] commented 1 year ago

Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

heaths commented 1 year ago

I'll loop in the service team as well, but any diagnostics information you can provide will be helpful in determining the underlying cause.

pascalu commented 1 year ago

After not seeing the issue for a while (since the 3rd of July) we could observe it today starting 3am CET on over 200 requests throughout the day.

issue

Before observing this again, we deployed the following KeyClient options as a possible mitigation to «fail & retry faster» as described in a previous comment:

new KeyClientOptions
            {
                Retry =
                {
                    MaxRetries = 3,
                    MaxDelay = TimeSpan.FromSeconds(5),
                    Delay = TimeSpan.FromSeconds(5),
                    NetworkTimeout = TimeSpan.FromSeconds(5)
                }
            }

The RetryOptions had the expected effect so the retry is attempted faster and while initially still having a failed dependency to KeyVault we can fulfill the clients request after a retry.

issue2

Therefore our conclusion would be that the problem is rather on the side of the service or its surrounding infrastructure than in the library as we initially suspected (even though in KeyVault we never have observed traces of failed requests).

heaths commented 1 year ago

Could you provide a few affected x-ms-client-request-ids? That will help the service team look into this better.

pascalu commented 1 year ago

Sure:

All are from July 18th

robersonliou commented 7 months ago

Hi @heaths , any updates here?

I encountered the same issue while trying to get secret via key vault SDK. we're using Azure.Security.KeyVault.Secrets 4.5.0, will the latest version (4.6.0) fix this problem? any advice to avoid the first request returned 401?

seanbamsft commented 7 months ago

@robersonliou the issue described in this thread turned out to be an outdated network driver on certain physical hosts in the datacenter that corrupted segmented packets with a TCP timestamp with the most significant bit set. Connections to Key Vault VMs running on these physical hosts would fail if the client had TCP timestamps enabled and the networking stack picked a random timestamp with the most significant bit set. The version of the SDK did not matter for this issue. There are no Key Vault VMs running on hosts with the outdated driver, so this thread does not apply to any ongoing issues.

The HTTP 401 is an expected part of the Key Vault authentication protocol. The client sends a request without an authentication token and the server responds with a challenge in the WWW-Authenticate header. The client handles this by requesting an auth token and then repeating the request with the token in the Authorization header. The unauthenticated request made it to the service (with a 401 response) because that request is small enough that the packet does not get segmented. The follow-up request exceeds the segment size due to the auth token, so it triggers the network driver bug.

github-actions[bot] commented 7 months ago

Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

pascalu commented 7 months ago

Thanks :) while we mitigated this issue in our code in the meanwhile it's nice to see it could be permanently resolved!