aio-libs / aiobotocore

asyncio support for botocore library using aiohttp
https://aiobotocore.aio-libs.org
Apache License 2.0
1.17k stars 182 forks source link

Apparently random NoCredentialsError after running for a while #1006

Open jsoucheiron opened 1 year ago

jsoucheiron commented 1 year ago

Describe the bug We have an aiohttp server that sends SQS messages as result of certain actions. After running for a while we'll get

Traceback (most recent call last):
  File "/usr/local/skyscanner/app/services/sqs.py", line 21, in send_message
    await client.send_message(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 141, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 161, in _make_request
    return await self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 77, in _send_request
    request = await self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 70, in create_request
    await self._event_emitter.emit(event_name, request=request,
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/hooks.py", line 27, in _emit
    response = await handler(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 16, in handler
    return await self.sign(operation_name, request)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 63, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.10/site-packages/botocore/auth.py", line 378, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

Our code that triggers the issue in production, where we use IAM roles:

class SQSService:
    def __init__(self, sqs_region: str, sqs_url: str):
        self.default_source = "unknown"
        self.sqs_region = sqs_region
        self.sqs_url = sqs_url

    async def send_message(self, pushed_data: Dict[str, Any], data_type: str, source: Optional[str]):
        try:
            session = get_session()
            async with session.create_client("sqs", region_name=self.sqs_region) as client:
                await client.send_message(
                    QueueUrl=self.sqs_url,
                    MessageBody=json_dumps_extended(
                        {"pushed_data": pushed_data, "data_type": data_type, "source": source or self.default_source}
                    ),
                )        
        except Exception:
            logger.exception(f"Something went wrong in SQS upload of {pushed_data}")

We've tried multiple versions including 2.0.0 and 2.5.0

After many many tests trying to find a way to reproduce the issue locally, we've managed to mitigate it using backoff. When we do, this is what we get:

image

This leads me to believe there's a run condition somewhere that only triggers after a while running where you might end up with missing credentials temporarily.

Checklist

pip freeze results

$ pip freeze
aiobotocore==2.0.0
aiocache==0.12.0
aiodns==3.0.0
aiohttp==3.8.1
aioitertools==0.11.0
aiosignal==1.3.1
aiotask-context==0.6.1
alembic==1.0.11
async-timeout==4.0.2
asyncpg==0.27.0
asyncpgsa==0.27.1
attrs==23.1.0
backoff==2.2.1
basictracer==3.2.0
black==23.3.0
boto3==1.19.8
botocore==1.22.8
Brotli==1.0.9
build==0.10.0
cachetools==5.3.0
cchardet==2.1.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
coverage==7.2.3
cryptography==3.4.8
Deprecated==1.2.13
exceptiongroup==1.1.1
flake8==6.0.0
frozenlist==1.3.3
googleapis-common-protos==1.59.0
grpcio==1.53.0
gunicorn==20.1.0
idna==3.4
importlib-metadata==6.4.1
iniconfig==2.0.0
isort==5.12.0
Jinja2==3.1.2
jmespath==0.10.0
jq==1.4.1
jsonpickle==3.0.1
lightstep==4.4.8
Mako==1.2.4
markdown-it-py==2.2.0
MarkupSafe==2.1.2
mccabe==0.7.0
mdurl==0.1.2
moto==4.1.7
multidict==6.0.4
mypy-extensions==1.0.0
object-pool==0.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-aiohttp-client==0.36b0
opentelemetry-instrumentation-logging==0.36b0
opentelemetry-opentracing-shim==0.36b0
opentelemetry-propagator-ot-trace==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
opentelemetry-util-http==0.36b0
opentracing==2.4.0
packaging==23.1
pathspec==0.11.1
pbr==5.11.1
pip-tools==6.13.0
platformdirs==3.2.0
pluggy==1.0.0
pprintpp==0.4.0
protobuf==3.20.3
psycopg2-binary==2.9.6
pycares==4.3.0
pycodestyle==2.10.0
pycparser==2.21
pydantic==1.10.7
pyflakes==3.0.1
pyformance==0.4
Pygments==2.15.0
pyproject_hooks==1.0.0
pytest==7.3.1
pytest-aiohttp==1.0.4
pytest-asyncio==0.21.0
pytest-clarity==1.0.1
pytest-cov==2.12.1
pytest-env==0.6.2
pytest-mock==1.12.1
python-dateutil==2.8.2
python-editor==1.0.4
PyYAML==6.0
requests==2.28.2
responses==0.23.1
rich==13.3.4
s3transfer==0.5.2
six==1.16.0
SQLAlchemy==1.3.24
statsd==3.3.0
thrift==0.16.0
toml==0.10.2
tomli==2.0.1
types-PyYAML==6.0.12.9
types-requests==2.28.11.17
types-urllib3==1.26.25.10
typing_extensions==4.5.0
urllib3==1.26.15
uvloop==0.17.0
Werkzeug==2.2.3
wrapt==1.15.0
xmltodict==0.13.0
yarl==1.8.2
zipp==3.15.0

Environment:

Additional context Happy to provide any further context to help resolve this.

thehesiod commented 1 year ago

there's definitely something funky with creds based on all the recent issues logged. we need a reliable test case where we can compare debug botocore and aiobotocore logs

jsoucheiron commented 1 year ago

I wish I could provide it but I haven't managed to reproduce this locally yet, just in production after running for a while.

hbikki commented 1 year ago

I noticed similar issue happening when reading/writing to S3 with process count > 5 for versions 2.4.2

thehesiod commented 1 year ago

any interesting info with debug level logging?

jsoucheiron commented 1 year ago

To add some additional context on this that might help untangle the issue:

Would it be a better approach to have a long-lived session instantiated in the class instead of creating a new one every time send_message() is called?

thehesiod commented 1 year ago

long lived session/client always preferred. botocore should take care of refreshing credentials.

jsoucheiron commented 1 year ago

If that's the case we should probably document it, specially if it can cause bugs like this one.

Missatge de Alexander Mohr @.***> del dia dt., 9 de maig 2023 a les 18:27:

long lived session/client always preferred. botocore should take care of refreshing credentials.

— Reply to this email directly, view it on GitHub https://github.com/aio-libs/aiobotocore/issues/1006#issuecomment-1540497681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN625SC6EUTWS2HZKKTI43XFJWANANCNFSM6AAAAAAXCQ3UUI . You are receiving this because you authored the thread.Message ID: @.***>

thehesiod commented 1 year ago

can you try after release with https://github.com/aio-libs/aiobotocore/pull/1022 available?

thehesiod commented 1 year ago

could be related to https://github.com/aio-libs/aiobotocore/issues/1025, I'd try once that release is available (later today)

thehesiod commented 10 months ago

actually the important part isn't the session, it's the client, you should keep your client for as long as possible. A client is tied to a connection pool, so it's heavy to keep re-creating them

thehesiod commented 10 months ago

to debug this I really need a reproducible test case. I have my own AWS account so if you can create a full encapsulated test case I can try to debug this otherwise there just isn't enough for me to go on here and I'll have to close it. Another option is to create a test case using moto

jsoucheiron commented 10 months ago

The problem is that given that the client is an async context manager there's not nice/elegant way to have a long lived client. You'd need to enter manually and create some teardown hook to exit.

thehesiod commented 10 months ago

sure there is, we do this all the time:

class SQSService:
    def __init__(self, sqs_region: str, sqs_url: str):
        self.default_source = "unknown"
        self.sqs_region = sqs_region
        self.sqs_url = sqs_url
        self._exit_stack = contextlib.AsyncExitStack()

    async def __aenter__(self):
        self._client =  await self._exit_stack.enter_async_context(session.create_client("sqs", region_name=self.sqs_region)
        return self

    async def __aexit__(self, *args):
        await self._exit_stack.__aexit__(*args)
jsoucheiron commented 10 months ago

This is the kind of pattern I'd love to see documented. If there are certain ways of using the library that minimize load or are generally best practices given how it internally operates we should make this explicit in the docs so people can adopt this patterns.

thehesiod commented 10 months ago

I think we assumed it was common knowledge but open to PRs / issues to add to docs

thehesiod commented 10 months ago

I'd like to be able to get to the bottom of what's causing this issue as well though. Unfortunately we'll need some sort of way to reproduce

nishantkumar1292 commented 1 month ago

Do we have a solution for this yet? I'm still experiencing this. However, I thought the issue was not random, and occurred with almost every call. But it could be because I only looked at later logs.

I wonder if explicit passing of access_key and secret_key would resolve this?

thehesiod commented 1 month ago

we need a way to repro or a detailed analysis from someone who can repro