aws / aws-sdk-net

The official AWS SDK for .NET. For more information on the AWS SDK for .NET, see our web site:
http://aws.amazon.com/sdkfornet/
Apache License 2.0
2.04k stars 853 forks source link

Deadlock acquiring credentials when GlobalRuntimeDependencyRegistry used #3153

Open martincostello opened 7 months ago

martincostello commented 7 months ago

Describe the bug

While doing research into publishing a .NET 8 application using native AoT to EKS, I found at deploy-time that the application would fail on a code path we have in an internal library where the AWS SDK would attempt to get credentials using STS. The exception would be the following:

Unhandled Exception: Amazon.RuntimeDependencies.MissingRuntimeDependencyException: Operation failed because of a missing runtime dependency. In Native AOT builds runtime dependencies can not be dynamically loaded from assembles. Instead the runtime dependency needs to be explicitly registered. To complete this operation register an instance of Amazon.SecurityToken.AmazonSecurityTokenServiceClient from package AWSSDK.SecurityToken using the operation Amazon.RuntimeDependencies.GlobalRuntimeDependencyRegistry.Instance.RegisterSecurityTokenServiceClient.

I added the code mentioned to the application and redeployed. At this point it appeared to no longer be responding and crash-looping.

Through various trial and error, I've distilled the problematic code path down to the code snippet below.

My hunch is that the code path is somehow re-entrantly trying to access this lock, causing it to block indefinitely:

https://github.com/aws/aws-sdk-net/blob/f7a3a70a75ff1c4d5bc877b10a03e836cfcbe32a/sdk/src/Core/Amazon.Runtime/Credentials/RefreshingAWSCredentials.cs#L144-L146

That's just a hunch - the issue could be somewhere completely different, but it's the first bit of synchronization I could find on the method call that is blocking. It also blocks using the synchronous version, GetCredentials(), so I'm pretty sure it's not something to do with sync-over-async.

The code appears to deadlock regardless of whether native AoT is actually used at runtime or not.

Expected Behavior

The operation does not block.

With the repro, the following messages should be printed to the console:

GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()
AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()
AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()
Credentials obtained

Current Behavior

The application blocks on the call to AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync().

GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()
AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()
AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()

Reproduction Steps

Run the following code within EKS where the instance credentials are available.

using Amazon.Runtime;
using Amazon.RuntimeDependencies;
using Amazon.SecurityToken;

Console.WriteLine("GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()");

GlobalRuntimeDependencyRegistry.Instance.RegisterSecurityTokenServiceClient(
    (_) =>
        new AmazonSecurityTokenServiceClient());

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()");

var credentials = AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables();

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()");

_ = await credentials.GetCredentialsAsync();

Console.WriteLine("Credentials obtained");

Possible Solution

No response

Additional Information/Context

No response

AWS .NET SDK and/or Package version used

AWSSDK.SecurityToken 3.7.300.38

Targeted .NET Platform

.NET 8

Operating System and version

Ubuntu 22.04.3 LTS

martincostello commented 7 months ago

Doing a bit more digging, this might be a case of more of a docs/footgun issue than a bug per-se.

Digging around some more, I see here that when not using native AoT, the client is created with explicit anonymous credentials:

https://github.com/aws/aws-sdk-net/blob/f7a3a70a75ff1c4d5bc877b10a03e836cfcbe32a/sdk/src/Core/Amazon.Runtime/Credentials/AssumeRoleWithWebIdentityCredentials.cs#L240

However, in my repro, I'm just calling the constructor with a region, it calls back into FallbackCredentialsFactory, which is what I traced through in a decompiler to get to the minimal repro in the first place:

https://github.com/aws/aws-sdk-net/blob/f7a3a70a75ff1c4d5bc877b10a03e836cfcbe32a/sdk/src/Services/SecurityToken/Generated/_netstandard/AmazonSecurityTokenServiceClient.cs#L90

I haven't tried changing my code yet to do the same, but it raises a few points

1) From the original error message, it isn't clear that the user needs to do this to prevent a loop when default/fallback credentials are used. It seems like an implementation detail the average user wouldn't know of or find. 2) Without me digging more in a debugger, if this is the case, I would have thought it would have caused a StackOverflowException rather than hang (unless the runtime is just failing to show that, so it appears to be hanging).

If this does turn out to be what is happening, then I guess the bug is less a deadlock, and more a need for clearer documentation on what the user needs to do in this case.

martincostello commented 7 months ago

Yep, that's what it was.

If I change the code to this, the problem goes away:

using Amazon.Runtime;
using Amazon.RuntimeDependencies;
using Amazon.SecurityToken;

Console.WriteLine("GlobalRuntimeDependencyRegistry.RegisterSecurityTokenServiceClient()");

GlobalRuntimeDependencyRegistry.Instance.RegisterSecurityTokenServiceClient(
-   (_) =>
-       new AmazonSecurityTokenServiceClient());
+   (context) =>
+       new AmazonSecurityTokenServiceClient(
+           new AnonymousAWSCredentials(),
+           context.SecurityTokenServiceClientContextData.Region));

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables()");

var credentials = AssumeRoleWithWebIdentityCredentials.FromEnvironmentVariables();

Console.WriteLine("AssumeRoleWithWebIdentityCredentials.GetCredentialsAsync()");

_ = await credentials.GetCredentialsAsync();

Console.WriteLine("Credentials obtained");

Maybe it would help give a hint for the user in this case if SecurityTokenServiceClientContext included a property for the credentials to use, which could be set to an instance of AnonymousAWSCredentials?

https://github.com/aws/aws-sdk-net/blob/f7a3a70a75ff1c4d5bc877b10a03e836cfcbe32a/sdk/src/Core/RuntimeDependencies/CreateInstanceContext.cs#L163

public class SecurityTokenServiceClientContext
{
    public enum ActionContext { AssumeRoleAWSCredentials, AssumeRoleWithWebIdentityCredentials, FederatedAWSCredentials };
    public ActionContext Action { get; set; }
    public RegionEndpoint Region { get; set; }
    public IWebProxy ProxySettings { get; set; }
+   public AWSCredentials Credentials { get; set; } = new AnonymousAWSCredentials();
}
normj commented 7 months ago

The STS dependency is the hardest one to deal with in the SDK for AOT or more specifically now the GlobalRuntimeDependencyRegistry. Even in the assume role case, which requires AWS credentials and not the anonymous credentials, you want to make sure your not just using the default credentials because that would go back to the profile that needs to assume a role instead of the profile that should make the STS assume role call.

I agree this area needs some smoothing out and at the very least better documentation.

beeradmoore commented 6 months ago

If you are using code that gets the role from the running instance which has role assumed in its execution context like Fargate or Lambda (which I believe is using assume role?), is the current correct action to keep <Assembly Name="AWSSDK.SecurityToken" Dynamic="Required All"></Assembly> in the rd.xml file? Or is there another way to use AssumeRoleAWSCredentials without hardcoding the role name itself?

Side note, it's crazy to me that this GitHub issue is the only reference found on Google for RegisterSecurityTokenServiceClient.