Open jsoucheiron opened 1 year ago
there's definitely something funky with creds based on all the recent issues logged. we need a reliable test case where we can compare debug botocore and aiobotocore logs
I wish I could provide it but I haven't managed to reproduce this locally yet, just in production after running for a while.
I noticed similar issue happening when reading/writing to S3 with process count > 5 for versions 2.4.2
any interesting info with debug level logging?
To add some additional context on this that might help untangle the issue:
SQSService
is long livedsession = get_session()
will be called multiple timesget_session()
calls might happen during token rotationWould it be a better approach to have a long-lived session instantiated in the class instead of creating a new one every time send_message()
is called?
long lived session/client always preferred. botocore should take care of refreshing credentials.
If that's the case we should probably document it, specially if it can cause bugs like this one.
Missatge de Alexander Mohr @.***> del dia dt., 9 de maig 2023 a les 18:27:
long lived session/client always preferred. botocore should take care of refreshing credentials.
— Reply to this email directly, view it on GitHub https://github.com/aio-libs/aiobotocore/issues/1006#issuecomment-1540497681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN625SC6EUTWS2HZKKTI43XFJWANANCNFSM6AAAAAAXCQ3UUI . You are receiving this because you authored the thread.Message ID: @.***>
can you try after release with https://github.com/aio-libs/aiobotocore/pull/1022 available?
could be related to https://github.com/aio-libs/aiobotocore/issues/1025, I'd try once that release is available (later today)
actually the important part isn't the session, it's the client, you should keep your client for as long as possible. A client is tied to a connection pool, so it's heavy to keep re-creating them
to debug this I really need a reproducible test case. I have my own AWS account so if you can create a full encapsulated test case I can try to debug this otherwise there just isn't enough for me to go on here and I'll have to close it. Another option is to create a test case using moto
The problem is that given that the client is an async context manager there's not nice/elegant way to have a long lived client. You'd need to enter manually and create some teardown hook to exit.
sure there is, we do this all the time:
class SQSService:
def __init__(self, sqs_region: str, sqs_url: str):
self.default_source = "unknown"
self.sqs_region = sqs_region
self.sqs_url = sqs_url
self._exit_stack = contextlib.AsyncExitStack()
async def __aenter__(self):
self._client = await self._exit_stack.enter_async_context(session.create_client("sqs", region_name=self.sqs_region)
return self
async def __aexit__(self, *args):
await self._exit_stack.__aexit__(*args)
This is the kind of pattern I'd love to see documented. If there are certain ways of using the library that minimize load or are generally best practices given how it internally operates we should make this explicit in the docs so people can adopt this patterns.
I think we assumed it was common knowledge but open to PRs / issues to add to docs
I'd like to be able to get to the bottom of what's causing this issue as well though. Unfortunately we'll need some sort of way to reproduce
Do we have a solution for this yet? I'm still experiencing this. However, I thought the issue was not random, and occurred with almost every call. But it could be because I only looked at later logs.
I wonder if explicit passing of access_key and secret_key would resolve this?
we need a way to repro or a detailed analysis from someone who can repro
Describe the bug We have an aiohttp server that sends SQS messages as result of certain actions. After running for a while we'll get
Our code that triggers the issue in production, where we use IAM roles:
We've tried multiple versions including 2.0.0 and 2.5.0
After many many tests trying to find a way to reproduce the issue locally, we've managed to mitigate it using backoff. When we do, this is what we get:
This leads me to believe there's a run condition somewhere that only triggers after a while running where you might end up with missing credentials temporarily.
Checklist
pip check
passes without errorspip freeze
resultspip freeze results
Environment:
Additional context Happy to provide any further context to help resolve this.