aws / aws-sdk-java-v2

The official AWS SDK for Java - Version 2
Apache License 2.0
2.19k stars 846 forks source link

InstanceProfileCredentialsProvider unable to refresh/recover from throttling or network problems at time of credential cache refresh #5247

Open steveloughran opened 5 months ago

steveloughran commented 5 months ago

Describe the bug

The full details and analysis are in HADOOP-19181. IAMCredentialsProvider throttle failures

It's notable that ContainerCredentialsProvider has more resilience in

Expected Behavior

IAMCredentialsProvider to always issue valid credentials. This should include asynchronous refreshing of credentials when enabled, far enough in advance of their expiry that recovery attempts can be repeated.

Current Behavior

invalid credentials were returned when requesting credentials to sign a request.

Reproduction Steps

see the hadoop bug report. The deployment was a single EC2 node running many services, long enough for the EC2 credentials to expire and need refreshing.

Possible Solution

I really don't see what we can do in the hadoop codebase to recover from this. We could consider extending our own IAMInstanceCredentialsProvider to wrap the retry failures with our own sleep/retry, but as it'd take > 10 seconds, API calls awaiting signing (e.g. S3 Express CreateSession) will time out.

I'd propose copying ContainerCredentialsProvider, if not with the retries then at least the expiry time many minutes ahead of actual credential expiry

Additional Information/Context

No response

AWS Java SDK version used

2.24.6

JDK version used

an openjdk java 8 build

Operating System and version

linux

debora-ito commented 4 months ago

Hi @steveloughran

we added a task to analyze and improve the credential experience. Thank you for the detailed bug report - HADOOP-19181.

steveloughran commented 4 months ago

Is there an approximate timeline for this? I don't want to have to reimplement our own refresh thread code -not because it is hard, but because maintenance and testing with fault injection are the pain points.

steveloughran commented 3 months ago

any timeline updates?

I am about to have to implement a workaround for what should be a foundational use case "reliably provide credentials to java applications running in EC2 instances". The fact that this works for k8s containers but not VMs implies that the test setup isn't stressing this, and it has become the tasks of downstream projects to (a) discover and (b) come up with workarounds

steveloughran commented 3 weeks ago

New stack trace. Single host launching 5 processes, triggering a 503 response. This is not during cache refresh, this is simply startup.

The good news: this is easier for you to replicate in a test before you fix.

comment from the engineer

Another data point is that while the failure is clearly seen in the error log,
the linklocal_allowance_exceeded metric remained 0 (zero) through the whole build,
so either the metric is broken, or if it tells the truth the failure is not in the local link layer,
but in the service behind the link.
24/10/01 17:48:04 WARN internal.InstanceMetadataServiceResourceFetcher: Fail to retrieve token 
com.amazonaws.AmazonServiceException: Service Unavailable (Service: null; Status Code: 503; Error Code: null; Request ID: null; Proxy: null)
    at com.amazonaws.internal.EC2ResourceFetcher.handleErrorResponse(EC2ResourceFetcher.java:161)
    at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:106)
    at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.getToken(InstanceMetadataServiceResourceFetcher.java:91)
    at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.readResource(InstanceMetadataServiceResourceFetcher.java:69)
    at com.amazonaws.internal.EC2ResourceFetcher.readResource(EC2ResourceFetcher.java:66)
    at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.getCredentialsEndpoint(InstanceMetadataServiceCredentialsFetcher.java:60)
    at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.getCredentialsResponse(InstanceMetadataServiceCredentialsFetcher.java:48)
    at com.amazonaws.auth.BaseCredentialsFetcher.fetchCredentials(BaseCredentialsFetcher.java:147)
    at com.amazonaws.auth.BaseCredentialsFetcher.getCredentials(BaseCredentialsFetcher.java:89)
    at com.amazonaws.auth.InstanceProfileCredentialsProvider.getCredentials(InstanceProfileCredentialsProvider.java:174)
    at com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper.getCredentials(EC2ContainerCredentialsProviderWrapper.java:75)
    at org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider.getCredentials(IAMInstanceCredentialsProvider.java:64)
    at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1269)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:845)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:794)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5520)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:6501)
    at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:6473)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5505)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5467)
    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1480)
    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1416)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:807)
    at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:543)
    at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:524)
    at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
    at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376)
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
    at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372)
    at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:347)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:804)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.doBucketProbing(S3AFileSystem.java:706)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:570)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:162)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3557)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3504)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:522)
    at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:260)
    at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:257)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:257)