boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.44k stars 1.06k forks source link

Container credentials never refresh #3134

Closed egorksv closed 3 months ago

egorksv commented 3 months ago

Describe the bug

I am using boto3 as a part of FastAPI application deployed as EKS pod with Pod Identity role.

Relevant env vars:

AWS_CONTAINER_CREDENTIALS_FULL_URI=http://169.254.170.23/v1/credentials
AWS_DEFAULT_REGION=us-east-1
AWS_REGION=us-east-1
AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE=/var/run/secrets/pods.eks.amazonaws.com/serviceaccount/eks-pod-identity-token
AWS_STS_REGIONAL_ENDPOINTS=regional

Application uses default boto Session. After certain time it fails MetadataRetrievalError and is unable to communicate with AWS APIs. Killing pod fixes the problem.

Using awscli2 on the pod works, i.e. aws sts get-caller-identity returns expected result.

Expected Behavior

Botocore should refresh container credentials automatically

Current Behavior

Application fails with "token expired" error:

Traceback (most recent call last):  File "/app/.venv/lib/python3.11/site-packages/botocore/credentials.py", line 1964, in fetch_creds    response = self._fetcher.retrieve_full_uri(              
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/app/.venv/lib/python3.11/site-packages/botocore/utils.py", line 3072, in retrieve_full_uri    return self._retrieve_credentials(full_url, headers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/app/.venv/lib/python3.11/site-packages/botocore/utils.py", line 3116, in _retrieve_credentials    return self._get_response(
           ^^^^^^^^^^^^^^^^^^^  File "/app/.venv/lib/python3.11/site-packages/botocore/utils.py", line 3138, in _get_response
    raise MetadataRetrievalError(botocore.exceptions.MetadataRetrievalError: Error retrieving metadata: 
Received non 200 response 400 from container metadata: [d9fb1c1c-86b6-43ec-b390-3372af5d9661]: (ExpiredTokenException): The token included in the request is expired: 
current date/time 2024-03-06T20:40:10.912988Z must be before the expiration date/time 2024-03-01T14:17:52Z.,
 fault: client
--

Reproduction Steps

Create long-lived Python application that uses default boto3 session Call any AWS API every 5 minutes

Deploy app as EKS pod with Container authorization

Possible Solution

No response

Additional Information/Context

No response

SDK version used

1.34.33

Environment details (OS name and version, etc.)

python:3.11-slim-buster docker container

egorksv commented 3 months ago

More details after looking further into log files: Pod start: 2024-02-29T18:27:06.547+04:00 First appearance of credentials refresh error: 2024-03-01T21:19:21.410+04:00

Log fragments when the error first appears:

Last successful AWS call querying DynamoDB table:

2024-03-01T21:17:18.365+04:00

Sync status: {'last_total_items': Decimal('677'), 'last_synced': '2024-03-01T17:15:18Z', 'batch_name': 'XXXXXXXX', 'last_page_size': Decimal('100'), 'last_total_pages': Decimal('7'), 'status': 'WAITNEWITEMS', 'last_page': Decimal('7')}
2024-03-01T21:19:21.410+04:00

Refreshing temporary credentials failed during advisory refresh period.
Traceback (most recent call last):
  File "/app/.venv/lib/python3.11/site-packages/botocore/credentials.py", line 1964, in fetch_creds
    response = self._fetcher.retrieve_full_uri(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/botocore/utils.py", line 3072, in retrieve_full_uri
    return self._retrieve_credentials(full_url, headers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/botocore/utils.py", line 3116, in _retrieve_credentials
    return self._get_response(
           ^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/botocore/utils.py", line 3138, in _get_response
    raise MetadataRetrievalError(
botocore.exceptions.MetadataRetrievalError: Error retrieving metadata: Received non 200 response 400 from container metadata: [5de44b40-d111-48f7-9845-d1a50e8d8577]: (ExpiredTokenException): The token included in the request is expired: current date/time 2024-03-01T17:19:20.399912Z must be before the expiration date/time 2024-03-01T14:26:31Z., fault: client

Events (date/times converted to GMT):

Pod start: 2024-02-29T14:27:06.547Z Token issued, initial expiry date/time: 2024-03-01T14:26:31Z # 24-hour token Last successful call to dynamodb: 2024-03-01T17:17:18.365Z # Session still up?! First attempt to refresh credentials during "advisory refresh period": 2024-03-01T17:19:21.410Z # 3 hours AFTER token was supposed to have expired?!

I strongly suspect there was a programming error somewhere around "advisory refresh period", as it was probably supposed to refresh 3 hours BEFORE expiration, but instead started refreshing 3 hour AFTER.

This does not look like time zone issue as I'm in GMT+4, three hours kind of "does not compute"

nateprewitt commented 3 months ago

Hi @egorksv,

It looks like you might be encountering https://github.com/boto/botocore/pull/3114 which was fixed a few weeks ago in 1.34.41. Have you tried updating to a more recent version of Boto3/Botocore?

egorksv commented 3 months ago

Oh, thanks, I was on .33, will test on .56