aws / aws-sdk-net

The official AWS SDK for .NET. For more information on the AWS SDK for .NET, see our web site:
http://aws.amazon.com/sdkfornet/
Apache License 2.0
2.07k stars 859 forks source link

.PauseExponentially() blocks ASP.NET 7 app startup for 30+ seconds #2546

Open lilith opened 1 year ago

lilith commented 1 year ago

Describe the bug

On a local Windows debug build, the AWSSDK pauses startup for a very long time with the following call stack:

Thread.Sleep()
EC2InstanceMetadata.PauseExponentially()
EC2InstanceMetadata.FetchApiToken()
EC2InstanceMetadata.GetItems()
EC2InstanceMetadata.GetItems()
EC2InstanceMetadata.GetData()
EC2InstanceMetadata.GetData()
EC2InstanceMetadata.get_IdentityDocument()
EC2InstanceMetadata.get_Region()
new InstanceProfileAWSRegion()
FallbackRegionFactory.<>c.<Reset>b__12_3()
FallbackRegionFactory.GetRegionEndpoint()
FallbackRegionFactory.GetRegionEndpoint()
ClientConfig.GetDefaultRegionEndpoint()
ClientConfig.get_RegionEndpoint()
ClientConfig.get_DefaultConfiguration()
ClientConfig.get_DefaultConfigurationMode()
ClientFactory.CreateConfig()
ClientFactory.CreateServiceClient()
ClientFactory.CreateServiceClient()
CallSiteRuntimeResolver.VisitFactory() [2]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSiteMain() [3]
CallSiteRuntimeResolver.VisitCache() [3]
CallSiteRuntimeResolver.VisitRootCache() [3]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSite() [3]
CallSiteRuntimeResolver.Resolve() [2]
DynamicServiceProviderEngine.<>c__DisplayClass1_0.<RealizeService>b__0() [2]
ServiceProviderEngine.GetService() [2]
ServiceProviderEngineScope.GetService() [2]
ServiceProviderServiceExtensions.GetRequiredService()
ServiceProviderServiceExtensions.GetRequiredService<Amazon.S3.IAmazonS3>()
S3ServiceExtensions.<>c__DisplayClass0_0.<AddImageflowS3Service>b__0()at C:\Users\lilith\Documents\GitHub\wimageflow-dotnet-server\src\Imageflow.Server.Storage.S3\S3ServiceExtensions.cs:line 18
CallSiteRuntimeResolver.VisitFactory() [1]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSiteMain() [2]
CallSiteRuntimeResolver.VisitCache() [2]
CallSiteRuntimeResolver.VisitRootCache() [2]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSite() [2]
CallSiteRuntimeResolver.VisitIEnumerable()
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSiteMain() [1]
CallSiteRuntimeResolver.VisitCache() [1]
CallSiteRuntimeResolver.VisitRootCache() [1]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSite() [1]
CallSiteRuntimeResolver.Resolve() [1]
DynamicServiceProviderEngine.<>c__DisplayClass1_0.<RealizeService>b__0() [1]
ServiceProviderEngine.GetService() [1]
ServiceProviderEngineScope.GetService() [1]
ActivatorUtilities.ConstructorMatcher.CreateInstance()
ActivatorUtilities.CreateInstance()
UseMiddlewareExtensions.<>c__DisplayClass4_0.<UseMiddleware>b__0()
ApplicationBuilder.Build()
GenericWebHostService.<StartAsync>d__31.MoveNext()
AsyncMethodBuilderCore.Start<Microsoft.AspNetCore.Hosting.GenericWebHostService.<StartAsync>d__31>()
AsyncTaskMethodBuilder.Start<Microsoft.AspNetCore.Hosting.GenericWebHostService.<StartAsync>d__31>()
GenericWebHostService.StartAsync()
Host.<StartAsync>d__9.MoveNext()
AsyncMethodBuilderCore.Start<Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>d__9>()
AsyncTaskMethodBuilder.Start<Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>d__9>()
Host.StartAsync()
HostingAbstractionsHostExtensions.<RunAsync>d__4.MoveNext()
AsyncMethodBuilderCore.Start<Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.<RunAsync>d__4>()
AsyncTaskMethodBuilder.Start<Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.<RunAsync>d__4>()
HostingAbstractionsHostExtensions.RunAsync()
HostingAbstractionsHostExtensions.Run()
Program.Main()

Prior to asking the DI container for the service, it is registered like so:

  services.AddAWSService<IAmazonS3>(new AWSOptions 
  {
      Credentials = new AnonymousAWSCredentials(),
      Region = RegionEndpoint.USEast1
  });

It is then accessed

      services.AddSingleton<IBlobProvider>((container) =>
      {
          var logger = container.GetRequiredService<ILogger<S3Service>>();
          var s3 = container.GetRequiredService<IAmazonS3>(); // This line takes 30 seconds.
          return new S3Service(options, s3, logger);
      });

Expected Behavior

I would expect that the SDK would not pause exponentially on my developer machine (or in production)

Current Behavior

A 30 second delay

Reproduction Steps

See above

Possible Solution

No response

Additional Information/Context

info: AWSSDK[0]
      Using AWS credentials specified with the AWSOptions.Credentials property

It appears that this mess is caused by AWS trying to get the region endpoint, and failing, despite it being explicitly configured in code.

Here's another stack trace:


WaitHandle.WaitOneNoCheck()at C:\Users\lilith\AppData\Roaming\JetBrains\Rider2022.3\resharper-host\SourcesCache\916c3b86afbb53acf2ca9da9c4d356e4e2aa5a094e79e9777baad46806363\WaitHandle.cs:line 161
WaitHandle.WaitOne()at C:\Users\lilith\AppData\Roaming\JetBrains\Rider2022.3\resharper-host\SourcesCache\916c3b86afbb53acf2ca9da9c4d356e4e2aa5a094e79e9777baad46806363\WaitHandle.cs:line 406
AsyncHelpers.ExclusiveSynchronizationContext.BeginMessageLoop()
AsyncHelpers.RunSync<System.Net.Http.HttpResponseMessage>()
AWSSDKUtils.ExecuteHttpRequest()
EC2InstanceMetadata.FetchApiToken()
EC2InstanceMetadata.GetItems()
EC2InstanceMetadata.GetItems()
EC2InstanceMetadata.GetData()
EC2InstanceMetadata.GetData()
EC2InstanceMetadata.get_IdentityDocument()
EC2InstanceMetadata.get_Region()
new InstanceProfileAWSRegion()
FallbackRegionFactory.<>c.<Reset>b__12_3()
FallbackRegionFactory.GetRegionEndpoint()
FallbackRegionFactory.GetRegionEndpoint()
ClientConfig.GetDefaultRegionEndpoint()
ClientConfig.get_RegionEndpoint()
ClientConfig.get_DefaultConfiguration()
ClientConfig.get_DefaultConfigurationMode()
ClientFactory.CreateConfig()
ClientFactory.CreateServiceClient()
ClientFactory.CreateServiceClient()
CallSiteRuntimeResolver.VisitFactory() [2]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSiteMain() [3]
CallSiteRuntimeResolver.VisitCache() [3]
CallSiteRuntimeResolver.VisitRootCache() [3]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSite() [3]
CallSiteRuntimeResolver.Resolve() [2]
DynamicServiceProviderEngine.<>c__DisplayClass1_0.<RealizeService>b__0() [2]
ServiceProviderEngine.GetService() [2]
ServiceProviderEngineScope.GetService() [2]
ServiceProviderServiceExtensions.GetRequiredService()
ServiceProviderServiceExtensions.GetRequiredService<Amazon.S3.IAmazonS3>()
S3ServiceExtensions.<>c__DisplayClass0_0.<AddImageflowS3Service>b__0()
CallSiteRuntimeResolver.VisitFactory() [1]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSiteMain() [2]
CallSiteRuntimeResolver.VisitCache() [2]
CallSiteRuntimeResolver.VisitRootCache() [2]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSite() [2]
CallSiteRuntimeResolver.VisitIEnumerable()
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSiteMain() [1]
CallSiteRuntimeResolver.VisitCache() [1]
CallSiteRuntimeResolver.VisitRootCache() [1]
CallSiteVisitor<RuntimeResolverContext, object>.VisitCallSite() [1]
CallSiteRuntimeResolver.Resolve() [1]
DynamicServiceProviderEngine.<>c__DisplayClass1_0.<RealizeService>b__0() [1]
ServiceProviderEngine.GetService() [1]
ServiceProviderEngineScope.GetService() [1]
ActivatorUtilities.ConstructorMatcher.CreateInstance()
ActivatorUtilities.CreateInstance()
UseMiddlewareExtensions.<>c__DisplayClass4_0.<UseMiddleware>b__0()
ApplicationBuilder.Build()
GenericWebHostService.<StartAsync>d__31.MoveNext()
AsyncMethodBuilderCore.Start<Microsoft.AspNetCore.Hosting.GenericWebHostService.<StartAsync>d__31>()
AsyncTaskMethodBuilder.Start<Microsoft.AspNetCore.Hosting.GenericWebHostService.<StartAsync>d__31>()
GenericWebHostService.StartAsync()
Host.<StartAsync>d__9.MoveNext()
AsyncMethodBuilderCore.Start<Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>d__9>()
AsyncTaskMethodBuilder.Start<Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>d__9>()
Host.StartAsync()
HostingAbstractionsHostExtensions.<RunAsync>d__4.MoveNext()
AsyncMethodBuilderCore.Start<Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.<RunAsync>d__4>()
AsyncTaskMethodBuilder.Start<Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.<RunAsync>d__4>()
HostingAbstractionsHostExtensions.RunAsync()
HostingAbstractionsHostExtensions.Run()
Program.Main()

AWS .NET SDK and/or Package version used

AWSSDK.S3 3.7.103.15 and AWSSDK.Extensions.NETCore.Setup 3.7.4

Targeted .NET Platform

.NET 7

Operating System and version

Windows 11

lilith commented 1 year ago

Here's a repro project: https://github.com/lilith/aws-pauses-exponentially

All this does is (a) AddAWSService, (b) .GetRequiredService(); inside AddSingleton, and (c) .UseMiddleware to trigger requisition of said singleton. It may be possible to further simplify, but since this is only a few lines, It's probably not helpful.

ashishdhingra commented 1 year ago

Hi @lilith,

Good afternoon.

Thanks for reporting the issue. Upon further analysis of SDK code, EC2InstanceMetadata.PauseExponentially() is the exponential retry option in cases where credentials could not be fetched from EC2 instance metadata service. I do see in your code that you are using Credentials = new AnonymousAWSCredentials() while configuring default AWS options. When you run your code locally, credentials are resolved per resolution order mentioned in Credential and profile resolution. In case SDK is unable to find credentials in the order, it reaches the last generator, which is EC2 instance metadata. Since your local machine might not be an EC2 instance, it ultimately fails, while trying exponential retry backoff. For credential resolution via EC2 instance metadata, the exponential backoff is determined by the following constants defined in the code:

DEFAULT_RETRIES = 3,
MIN_PAUSE_MS = 250,
MAX_RETRIES = 3,
DEFAULT_APITOKEN_TTL = 21600;

So in total the credential resolution in your case might be taking 30 seconds.

Could you please advise how credentials are configured for your application?

Thanks, Ashish

lilith commented 1 year ago

I don't want to use credentials. I want to make anonymous requests to public buckets and blobs. How, other than using AnonymousAWSCredentials, can I accomplish this? I also do not want to have 30 seconds of downtime each deploy.

Also, this is library code, so I cannot require my end users to specify environment variables or files.

ashishdhingra commented 1 year ago

EC2InstanceMetadata.get_Region()

@lilith My apologies, I think I misread the call stack. Looks like it is trying to resolve the region from EC2InstanceMetadata based on the stack:

Thread.Sleep()
EC2InstanceMetadata.PauseExponentially()
EC2InstanceMetadata.FetchApiToken()
EC2InstanceMetadata.GetItems()
EC2InstanceMetadata.GetItems()
EC2InstanceMetadata.GetData()
EC2InstanceMetadata.GetData()
EC2InstanceMetadata.get_IdentityDocument()
EC2InstanceMetadata.get_Region()
new InstanceProfileAWSRegion()
FallbackRegionFactory.<>c.<Reset>b__12_3()
FallbackRegionFactory.GetRegionEndpoint()
FallbackRegionFactory.GetRegionEndpoint()
ClientConfig.GetDefaultRegionEndpoint()
ClientConfig.get_RegionEndpoint()
ClientConfig.get_DefaultConfiguration()
ClientConfig.get_DefaultConfigurationMode()
ClientFactory.CreateConfig()
ClientFactory.CreateServiceClient()

I guess it is using the anonymous credentials per your scenario to access public buckets and is working fine.

Based on the call stack, EC2 instance metadata class:

On you local development, this process would fail since it would retry 3 times, pausing exponentially in between.

Please advise how is region configured in your application? In order to avoid region resolution, the possible workaround would be to configure region for S3 client based on region resolution order.

Thanks, Ashish

lilith commented 1 year ago

I created a repository and whittled a reproducible project down for you as mentioned earlier:

https://github.com/lilith/aws-pauses-exponentially/blob/main/Startup.cs#L50-L64

services.AddAWSService<IAmazonS3>(new AWSOptions 
            {
                Credentials = new AnonymousAWSCredentials(),
                Region = RegionEndpoint.USEast1
            });

Region is being set in AWSOptions.

As an aside, I never personally use EC2 and would love to turn off all that nonsense for my personal apps. For the purposes of the Nuget libraries, I need a way to provide users with S3 integration without killing their uptime.

ashishdhingra commented 1 year ago

I created a repository and whittled a reproducible project down for you as mentioned earlier:

https://github.com/lilith/aws-pauses-exponentially/blob/main/Startup.cs#L50-L64

services.AddAWSService<IAmazonS3>(new AWSOptions 
            {
                Credentials = new AnonymousAWSCredentials(),
                Region = RegionEndpoint.USEast1
            });

Region is being set in AWSOptions.

As an aside, I never personally use EC2 and would love to turn off all that nonsense for my personal apps. For the purposes of the Nuget libraries, I need a way to provide users with S3 integration without killing their uptime.

@lilith Thanks for your response. I will investigate to determine what is happening and provide you updates accordingly.

lilith commented 1 year ago

Also, I think it's really not a good choice to have network calls occur synchronously during application startup, for (a) speed, (b) reliability, and (c) since this behavior is completely hidden from users. That kind of dangerous behavior should require explicit opt-in. Fail fast by default, please.

normj commented 1 year ago

You can set the AWS_EC2_METADATA_DISABLED environment variable to false to disable the EC2 instance metadata lookup.

lilith commented 1 year ago

Yes, but I'm shipping libraries not applications.

lilith commented 1 year ago

And when AWSSDK.dll causes issues in my libs, they get filed on my repos, not yours.

lilith commented 1 year ago

Ideally, I'd like there to be a way - via C# - to make the sdk well-behaved.

normj commented 1 year ago

Interesting adding an S3 client with Anonymous credentials to the DI as a library. Hope your use cases won't interfere with other cases an application might be relying on an IAmazonS3 in the DI with credentials.

Thanks for the repo code. I stepped through the code and it looks like the SDK is getting tripped up determining the DefaultConfigurationMode which goes to EC2 instance metadata. We should figure out how to prevent that from happening in use cases like yours.

For a work around if you explicitly set the DefaultConfigurationMode like below that fixed the slowness for me.

services.AddAWSService<IAmazonS3>(new AWSOptions 
{
    Credentials = new AnonymousAWSCredentials(),
    Region = RegionEndpoint.USEast1,
    DefaultConfigurationMode = DefaultConfigurationMode.Standard
});
lilith commented 1 year ago

No, in this instance the library consumer is responsible for calling .AddAWSService(), but there are cases where an anonymous client is needed in a library (direct, not DI) and those cases had a similar issue.

Thank you for the workaround, I'll push this change to the many example projects and docs that use .AddAWSService()

lilith commented 1 year ago

I noticed the docs mention that behavior of DefaultConfigurationMode.Standard may change over time. Is there an option that can explicitly opt-out of high-latency behavior permanently?

normj commented 1 year ago

Not sure what you mean by high-latency behavior in this case. Are you talking about the bug that is triggering the slowness creating service clients that you reported? Under the covers DefaultConfigurationMode is always resolved, but in your case there is a bug that it should not trigger an EC2 metadata lookup. So the only way to avoid that is skip the auto resolution and explicit set it.

I should have been more clear and mentioned the actual default used by the SDK is Legacy if no environment factors, like environment variables, cause us to detect a different mode. Legacy won't change as long we are on this major version of the SDK. But the settings for Standard are more modern especially for retry behavior.

lilith commented 1 year ago

For example, if I wish to ensure that configuration will not be acquired over the network - that it will come from env vars or other locally, instantly accessible data stores, what would I set? What if I only want to all configuration to be loaded via code?

normj commented 1 year ago

The actual configuration is not acquired over the network, the SDK looks at the environment to figure out which of the modes to use. For S3 you can see here where the configurations are defined. https://github.com/aws/aws-sdk-net/blob/82ec756de374a6a2d0dce55df00b537a6929a577/sdk/src/Services/S3/Generated/AmazonS3DefaultConfiguration.cs#L54

ashishdhingra commented 1 year ago

@lilith Please advise if the inputs provided by @normj provides necessary guidance/workaround, and if we are good to close the issue.

lilith commented 1 year ago

As I understand it, there is an open bug here. I don't understand why this would be closed.

On Tue, Feb 28, 2023, 11:10 AM Ashish Dhingra @.***> wrote:

@lilith https://github.com/lilith Please advise if the inputs provided by @normj https://github.com/normj provides necessary guidance/workaround, and if we are good to close the issue.

— Reply to this email directly, view it on GitHub https://github.com/aws/aws-sdk-net/issues/2546#issuecomment-1448643401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2LHY7JMWYHKLXJ4UVIZTWZY5QLANCNFSM6AAAAAAVGOUHXM . You are receiving this because you were mentioned.Message ID: @.***>

ashishdhingra commented 1 year ago

Just for reference, when the region is explicitly set, in default configuration mode it is still probing for region, ultimately failing at EC2 instance metadata. In this case, implicit default configuration mode should not do region lookup. In this case, user is not on an EC2 instance.