Apparently random NoCredentialsError after running for a while

jsoucheiron commented 1 year ago

Describe the bug We have an aiohttp server that sends SQS messages as result of certain actions. After running for a while we'll get

Traceback (most recent call last):
  File "/usr/local/skyscanner/app/services/sqs.py", line 21, in send_message
    await client.send_message(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 141, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 161, in _make_request
    return await self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 77, in _send_request
    request = await self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 70, in create_request
    await self._event_emitter.emit(event_name, request=request,
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/hooks.py", line 27, in _emit
    response = await handler(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 16, in handler
    return await self.sign(operation_name, request)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 63, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.10/site-packages/botocore/auth.py", line 378, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

Our code that triggers the issue in production, where we use IAM roles:

class SQSService:
    def __init__(self, sqs_region: str, sqs_url: str):
        self.default_source = "unknown"
        self.sqs_region = sqs_region
        self.sqs_url = sqs_url

    async def send_message(self, pushed_data: Dict[str, Any], data_type: str, source: Optional[str]):
        try:
            session = get_session()
            async with session.create_client("sqs", region_name=self.sqs_region) as client:
                await client.send_message(
                    QueueUrl=self.sqs_url,
                    MessageBody=json_dumps_extended(
                        {"pushed_data": pushed_data, "data_type": data_type, "source": source or self.default_source}
                    ),
                )        
        except Exception:
            logger.exception(f"Something went wrong in SQS upload of {pushed_data}")

We've tried multiple versions including 2.0.0 and 2.5.0

After many many tests trying to find a way to reproduce the issue locally, we've managed to mitigate it using backoff. When we do, this is what we get:

This leads me to believe there's a run condition somewhere that only triggers after a while running where you might end up with missing credentials temporarily.

Checklist

[x] I have reproduced in environment where pip check passes without errors
[x] I have provided pip freeze results
[x] I have provided sample code or detailed way to reproduce
[ ] I have tried the same code in botocore to ensure this is an aiobotocore specific issue
[ ] I have tried similar code in aiohttp to ensure this is is an aiobotocore specific issue
[x] I have checked the latest and older versions of aiobotocore/aiohttp/python to see if this is a regression / injection

pip freeze results

$ pip freeze
aiobotocore==2.0.0
aiocache==0.12.0
aiodns==3.0.0
aiohttp==3.8.1
aioitertools==0.11.0
aiosignal==1.3.1
aiotask-context==0.6.1
alembic==1.0.11
async-timeout==4.0.2
asyncpg==0.27.0
asyncpgsa==0.27.1
attrs==23.1.0
backoff==2.2.1
basictracer==3.2.0
black==23.3.0
boto3==1.19.8
botocore==1.22.8
Brotli==1.0.9
build==0.10.0
cachetools==5.3.0
cchardet==2.1.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
coverage==7.2.3
cryptography==3.4.8
Deprecated==1.2.13
exceptiongroup==1.1.1
flake8==6.0.0
frozenlist==1.3.3
googleapis-common-protos==1.59.0
grpcio==1.53.0
gunicorn==20.1.0
idna==3.4
importlib-metadata==6.4.1
iniconfig==2.0.0
isort==5.12.0
Jinja2==3.1.2
jmespath==0.10.0
jq==1.4.1
jsonpickle==3.0.1
lightstep==4.4.8
Mako==1.2.4
markdown-it-py==2.2.0
MarkupSafe==2.1.2
mccabe==0.7.0
mdurl==0.1.2
moto==4.1.7
multidict==6.0.4
mypy-extensions==1.0.0
object-pool==0.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-aiohttp-client==0.36b0
opentelemetry-instrumentation-logging==0.36b0
opentelemetry-opentracing-shim==0.36b0
opentelemetry-propagator-ot-trace==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
opentelemetry-util-http==0.36b0
opentracing==2.4.0
packaging==23.1
pathspec==0.11.1
pbr==5.11.1
pip-tools==6.13.0
platformdirs==3.2.0
pluggy==1.0.0
pprintpp==0.4.0
protobuf==3.20.3
psycopg2-binary==2.9.6
pycares==4.3.0
pycodestyle==2.10.0
pycparser==2.21
pydantic==1.10.7
pyflakes==3.0.1
pyformance==0.4
Pygments==2.15.0
pyproject_hooks==1.0.0
pytest==7.3.1
pytest-aiohttp==1.0.4
pytest-asyncio==0.21.0
pytest-clarity==1.0.1
pytest-cov==2.12.1
pytest-env==0.6.2
pytest-mock==1.12.1
python-dateutil==2.8.2
python-editor==1.0.4
PyYAML==6.0
requests==2.28.2
responses==0.23.1
rich==13.3.4
s3transfer==0.5.2
six==1.16.0
SQLAlchemy==1.3.24
statsd==3.3.0
thrift==0.16.0
toml==0.10.2
tomli==2.0.1
types-PyYAML==6.0.12.9
types-requests==2.28.11.17
types-urllib3==1.26.25.10
typing_extensions==4.5.0
urllib3==1.26.15
uvloop==0.17.0
Werkzeug==2.2.3
wrapt==1.15.0
xmltodict==0.13.0
yarl==1.8.2
zipp==3.15.0

Environment:

Python Version: [e.g. 3.9, 3.10, 3.11]
OS name and version: [e.g. linux(python-slim docker)]
We haven't been able to reproduce for 3.9 and aiobotocore==1.2.2

Additional context Happy to provide any further context to help resolve this.

thehesiod commented 1 year ago

there's definitely something funky with creds based on all the recent issues logged. we need a reliable test case where we can compare debug botocore and aiobotocore logs

jsoucheiron commented 1 year ago

I wish I could provide it but I haven't managed to reproduce this locally yet, just in production after running for a while.

hbikki commented 1 year ago

I noticed similar issue happening when reading/writing to S3 with process count > 5 for versions 2.4.2

thehesiod commented 1 year ago

any interesting info with debug level logging?

jsoucheiron commented 1 year ago

To add some additional context on this that might help untangle the issue:

SQSService is long lived
session = get_session() will be called multiple times
get_session() calls might happen during token rotation

Would it be a better approach to have a long-lived session instantiated in the class instead of creating a new one every time send_message() is called?

thehesiod commented 1 year ago

long lived session/client always preferred. botocore should take care of refreshing credentials.

jsoucheiron commented 1 year ago

If that's the case we should probably document it, specially if it can cause bugs like this one.

Missatge de Alexander Mohr @.***> del dia dt., 9 de maig 2023 a les 18:27:

long lived session/client always preferred. botocore should take care of refreshing credentials.

— Reply to this email directly, view it on GitHub https://github.com/aio-libs/aiobotocore/issues/1006#issuecomment-1540497681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN625SC6EUTWS2HZKKTI43XFJWANANCNFSM6AAAAAAXCQ3UUI . You are receiving this because you authored the thread.Message ID: @.***>

thehesiod commented 1 year ago

can you try after release with https://github.com/aio-libs/aiobotocore/pull/1022 available?

thehesiod commented 1 year ago

could be related to https://github.com/aio-libs/aiobotocore/issues/1025, I'd try once that release is available (later today)