Open steveloughran opened 5 months ago
Hi @steveloughran
we added a task to analyze and improve the credential experience. Thank you for the detailed bug report - HADOOP-19181.
Is there an approximate timeline for this? I don't want to have to reimplement our own refresh thread code -not because it is hard, but because maintenance and testing with fault injection are the pain points.
any timeline updates?
I am about to have to implement a workaround for what should be a foundational use case "reliably provide credentials to java applications running in EC2 instances". The fact that this works for k8s containers but not VMs implies that the test setup isn't stressing this, and it has become the tasks of downstream projects to (a) discover and (b) come up with workarounds
New stack trace. Single host launching 5 processes, triggering a 503 response. This is not during cache refresh, this is simply startup.
The good news: this is easier for you to replicate in a test before you fix.
comment from the engineer
Another data point is that while the failure is clearly seen in the error log,
the linklocal_allowance_exceeded metric remained 0 (zero) through the whole build,
so either the metric is broken, or if it tells the truth the failure is not in the local link layer,
but in the service behind the link.
24/10/01 17:48:04 WARN internal.InstanceMetadataServiceResourceFetcher: Fail to retrieve token
com.amazonaws.AmazonServiceException: Service Unavailable (Service: null; Status Code: 503; Error Code: null; Request ID: null; Proxy: null)
at com.amazonaws.internal.EC2ResourceFetcher.handleErrorResponse(EC2ResourceFetcher.java:161)
at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:106)
at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.getToken(InstanceMetadataServiceResourceFetcher.java:91)
at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.readResource(InstanceMetadataServiceResourceFetcher.java:69)
at com.amazonaws.internal.EC2ResourceFetcher.readResource(EC2ResourceFetcher.java:66)
at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.getCredentialsEndpoint(InstanceMetadataServiceCredentialsFetcher.java:60)
at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.getCredentialsResponse(InstanceMetadataServiceCredentialsFetcher.java:48)
at com.amazonaws.auth.BaseCredentialsFetcher.fetchCredentials(BaseCredentialsFetcher.java:147)
at com.amazonaws.auth.BaseCredentialsFetcher.getCredentials(BaseCredentialsFetcher.java:89)
at com.amazonaws.auth.InstanceProfileCredentialsProvider.getCredentials(InstanceProfileCredentialsProvider.java:174)
at com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper.getCredentials(EC2ContainerCredentialsProviderWrapper.java:75)
at org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider.getCredentials(IAMInstanceCredentialsProvider.java:64)
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:177)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1269)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:845)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:794)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5520)
at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:6501)
at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:6473)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5505)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5467)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1480)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1416)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:807)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:543)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:524)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:347)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:804)
at org.apache.hadoop.fs.s3a.S3AFileSystem.doBucketProbing(S3AFileSystem.java:706)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:570)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:162)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3557)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3504)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:522)
at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:260)
at org.apache.hadoop.fs.FileSystem$1.run(FileSystem.java:257)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:257)
Describe the bug
The full details and analysis are in HADOOP-19181. IAMCredentialsProvider throttle failures
ComparableUtils.minimum(Duration.ofMillis(exponentialBackoffMillis), Duration.ofSeconds(10))
. That is: up to 9 seconds after the credentials expire.It's notable that
ContainerCredentialsProvider
has more resilience inExpected Behavior
IAMCredentialsProvider
to always issue valid credentials. This should include asynchronous refreshing of credentials when enabled, far enough in advance of their expiry that recovery attempts can be repeated.Current Behavior
invalid credentials were returned when requesting credentials to sign a request.
Reproduction Steps
see the hadoop bug report. The deployment was a single EC2 node running many services, long enough for the EC2 credentials to expire and need refreshing.
Possible Solution
I really don't see what we can do in the hadoop codebase to recover from this. We could consider extending our own
IAMInstanceCredentialsProvider
to wrap the retry failures with our own sleep/retry, but as it'd take > 10 seconds, API calls awaiting signing (e.g. S3 Express CreateSession) will time out.I'd propose copying ContainerCredentialsProvider, if not with the retries then at least the expiry time many minutes ahead of actual credential expiry
Additional Information/Context
No response
AWS Java SDK version used
2.24.6
JDK version used
an openjdk java 8 build
Operating System and version
linux