Closed pascalu closed 7 months ago
Thank you for your feedback. Tagging and routing to the team member best able to assist.
The 401 behavior is normal and the token seems to be acquired as expected on the first call, so it doesn't sound related to Identity. @heaths , could you take a look?
The Key Vault clients don't have any synchronization now that all the challenge response was moved to core. ChallengeBasedAuthenticationPolicy is now just implementing that. All synchronization is done there. Adding @AlexanderSher and label Azure.Core for now. I suspect the issue is there but can take a look later if Alexander can't.
@pascalu have you checked k8s network logs? This appears to be a timeout based on an inactive socket connection. That should not be coming from the service itself.
After some more digging, we should be past any synchronized code by this point, so a deadlock in the client itself seems unlikely.
And while it won't necessarily fix the problem, if you're making a lot of requests to Key Vault, keeping in mind it's low rate limits, you might checkout the sample https://learn.microsoft.com/samples/azure/azure-sdk-for-net/azure-key-vault-proxy/ and consider something similar. Of course, if you're only calling crypto operations this won't help, but if you're constantly fetching keys - or you could even just cache the JWKs in memory - this could help mitigate the issue as well.
Hello @heaths
Thanks for having a look at this and for your feedback. Our network logs in Azure Kubernetes Service give no indication that there are errors on outgoing connections during the period we could observe the bug. Additionally, we interact with several other Azure services and none of them seem to ever have the same or similar problems.
As a concrete next action we will configure a much lower network request timeout in the KeyClientOptions of the KeyClient. While this will not get rid of the problem in theory (as the request is automatically retried after the timeout as visible in the telemetry of application insights) it should still be able to mitigate the impact for now and allow us to further investigate the problem.
PS: Thanks for the hint with Azure KeyVault Proxy. Our full solution implements caching of keys already in order to reduce the dependency to KeyVault but it's a nice addition to look into.
I'll try to reproduce in the specified container image, but am puzzled since we use the same transport pipeline all the other SDKs use - Azure. SDKs, anyway; Microsoft.Azure. are most often deprecated and an entirely different codebase.
Was there more of the stack, though? Was it the GetKeyAsync
method that seemed to deadlock? I'd expect the Key Vault SDK's methods to be in the stack.
Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Hi @heaths
I'm not sure if I understand your comment about Microsoft.Azure.* being deprecated - could you elaborate?
There wasn't more to the stack, it turns out the GetKeyAsync method actually completes successfully after the timeout as the exception seems to be handled on transport level and is retried automatically:
So eventually it could be related or similar to the problem which is described in https://github.com/Azure/azure-sdk-for-net/issues/37477
You said you interact with different services. I was pointing out that almost all Microsoft.Azure. SDK packages are deprecated e.g., https://www.nuget.org/packages/Microsoft.Azure.KeyVault, so I don't recommend using a mix of those and Azure. SDK packages. They use different transport mechanisms.
If GetKeyAsync
did complete successfully, it could be #37477 as you said.
Are you able to trace HTTP traffic as well? It would be useful to see if Key Vault is responding in time to figure out what is actually causing the SocketException
. To be clear, this is the first I've seen any issue regarding it.
Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
@pascalu in what region(s) are your vaults? Also, did this just start happening around the time you opened this issue, or has it been happening for a while?
@heaths the Keyvaults are located in Switzerland North. It has been occurring for a while in irregular intervals in the month before we submitted the issue.
We'll try to get more insights in to the HTTP traffic to be able to pinpoint why the SocketException occurs.
Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
I'll loop in the service team as well, but any diagnostics information you can provide will be helpful in determining the underlying cause.
After not seeing the issue for a while (since the 3rd of July) we could observe it today starting 3am CET on over 200 requests throughout the day.
Before observing this again, we deployed the following KeyClient options as a possible mitigation to «fail & retry faster» as described in a previous comment:
new KeyClientOptions
{
Retry =
{
MaxRetries = 3,
MaxDelay = TimeSpan.FromSeconds(5),
Delay = TimeSpan.FromSeconds(5),
NetworkTimeout = TimeSpan.FromSeconds(5)
}
}
The RetryOptions had the expected effect so the retry is attempted faster and while initially still having a failed dependency to KeyVault we can fulfill the clients request after a retry.
Therefore our conclusion would be that the problem is rather on the side of the service or its surrounding infrastructure than in the library as we initially suspected (even though in KeyVault we never have observed traces of failed requests).
Could you provide a few affected x-ms-client-request-id
s? That will help the service team look into this better.
Sure:
All are from July 18th
Hi @heaths , any updates here?
I encountered the same issue while trying to get secret via key vault SDK. we're using Azure.Security.KeyVault.Secrets 4.5.0, will the latest version (4.6.0) fix this problem? any advice to avoid the first request returned 401?
@robersonliou the issue described in this thread turned out to be an outdated network driver on certain physical hosts in the datacenter that corrupted segmented packets with a TCP timestamp with the most significant bit set. Connections to Key Vault VMs running on these physical hosts would fail if the client had TCP timestamps enabled and the networking stack picked a random timestamp with the most significant bit set. The version of the SDK did not matter for this issue. There are no Key Vault VMs running on hosts with the outdated driver, so this thread does not apply to any ongoing issues.
The HTTP 401 is an expected part of the Key Vault authentication protocol. The client sends a request without an authentication token and the server responds with a challenge in the WWW-Authenticate header. The client handles this by requesting an auth token and then repeating the request with the token in the Authorization header. The unauthenticated request made it to the service (with a 401 response) because that request is small enough that the packet does not get segmented. The follow-up request exceeds the segment size due to the auth token, so it triggers the network driver bug.
Hi @pascalu. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.
Thanks :) while we mitigated this issue in our code in the meanwhile it's nice to see it could be permanently resolved!
Library name and version
Azure.Security.KeyVault.Keys 4.5.0
Describe the bug
When interacting with Azure KeyVault we occasionally notice request which seem to time out with the following exception:
In Application Insights we can see the following telemetry. The issue only occurs when multiple requests are processed at the same time:
Request 1
Request 2
To summarize the above:
However, looking at the Insights of KeyVault it does at no point indicate that requests fail or have a significant latency so it seems the problem is in the KeyVault client and not caused by KeyVault itself.
The issue related to some extent to the behavior described in https://github.com/Azure/azure-sdk-for-net/issues/4016. However, as there is not much information provided in that issue it's hard to verify.
Expected behavior
Deadlocks should not occur
Actual behavior
Deadlocks occur from time to time
Reproduction Steps
The original issue is occuring in a more complex setup. However, we could reproduce it occasionally in this very minimal PoC.
However:
It's worth noting that the KeyVaultService is injected via dependency injection so a single instance is shared among requests:
The issue could be triggered by calling repeatedly/simultaneously the follwing API endpoint. The issue could be observed independently of a specific KeyVault operation (e.g. also on wrap/unwrap key)
Environment
Container Image: mcr.microsoft.com/dotnet/aspnet:7.0