aio-libs / aiobotocore

asyncio support for botocore library using aiohttp
https://aiobotocore.aio-libs.org
Apache License 2.0
1.16k stars 182 forks source link

Throttled responses with status 429 crash aiobotocore #1128

Closed sfc-gh-afedorov closed 1 month ago

sfc-gh-afedorov commented 1 month ago

Describe the bug On high throughput use or a busy account, AWS responds with 429, which botocore Retry documentation says is a condition for a retry, but aiobotocore crashes with

./../../venv/lib/python3.9/site-packages/aiobotocore/paginate.py:30 in __anext__
response = await self._make_request(current_kwargs)
./../../venv/lib/python3.9/site-packages/aiobotocore/client.py:391 in _make_api_call
http, parsed_response = await self._make_request(
./../../venv/lib/python3.9/site-packages/aiobotocore/client.py:419 in _make_request
return await self._endpoint.make_request(
./../../venv/lib/python3.9/site-packages/aiobotocore/endpoint.py:99 in _send_request
success_response, exception = await self._get_response(
./../../venv/lib/python3.9/site-packages/aiobotocore/endpoint.py:141 in _get_response
success_response, exception = await self._do_get_response(
./../../venv/lib/python3.9/site-packages/aiobotocore/endpoint.py:217 in _do_get_response
parsed_response = parser.parse(
./../../venv/lib/python3.9/site-packages/botocore/parsers.py:250 in parse
parsed = self._do_error_parse(response, shape)
./../../venv/lib/python3.9/site-packages/botocore/parsers.py:628 in _do_error_parse
original = super()._do_error_parse(response, shape)
./../../venv/lib/python3.9/site-packages/botocore/parsers.py:559 in _do_error_parse
root = self._parse_xml_string_to_dom(xml_contents)
./../../venv/lib/python3.9/site-packages/botocore/parsers.py:508 in _parse_xml_string_to_dom
raise ResponseParserError(
botocore.parsers.ResponseParserError: Unable to parse response (no element found: line 1, column 0), invalid XML received. Further retries may succeed:
b''

Checklist

pip freeze results

Environment:

jakob-keller commented 1 month ago

Thank you for reporting the crash!

I briefly went through the stacktrace and believe the issue is not directly related to aiobotocore, but rather lies with response parsing done in botocore.

The stacktrace suggests that the crash occurs when working with pagination, which could be relevant.

Could you please provide more context? Which API call and client configuration is triggering the exception? It could well be an edge case that needs to be raised with botocore, but more information would be useful to tell for sure.

sfc-gh-afedorov commented 1 month ago

Thanks for the prompt reply! I see it happening with both paginated and unpaginated API calls, e.g. iam's generate_credential_report and sts's assume_role (I'm creating sessions to ingest inventory from a lot of accounts).

I think when the response status is 429, the response should not be parsed, as it's an empty string.

Happy to expound more but want to be specific as to what might be most helpful. The use-case I'm working on is a bit unique in snowflakedb/SnowAlert/pull/676 because there are a lot of concurrent sessions, but I think I can probably reproduce it more simply or just write a local unit test if that's more helpful.

jakob-keller commented 1 month ago

Thank you for following up and providing helpful context.

I still feel that this should be fixed in botocore. There is an open issue that could be similar, but has not been fixed AFAIK: https://github.com/boto/botocore/issues/1732

Is there a way to replicate the exception with botocore? If botocore itself is also affected, we should raise the issue there. Otherwise, I am obviously missing something.

sfc-gh-afedorov commented 1 month ago

That ticket says "There's also a too_many_requests that only applies for 429 error codes" which is what's also in the Retry docs I linked. They're asking that botocore also handle status=400 errors specifically for requests to cognito-idp endpoints.

The _retry.json they mention already has the 429 error defined as "too_many_requests" here which looks like it uses random delay with an growth_factor of 2 for 5 retries.

I'm not sure if it can be fixed in botocore since it actually needs to retry on a 429 response, which only aiobotocore can do because it needs to use aio?

jakob-keller commented 1 month ago

Did you customise your retry config? By default, legacy mode is used which does not appear to handle 429 (TooManyRequestsException). You might need to use the standard mode in your scenario.

sfc-gh-afedorov commented 1 month ago

Thanks for suggesting that — I had tried it briefly before but also just now with retries={'mode': 'standard'} and also adaptive, and I'm still getting the same error on sts assume-role:

2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/connectors/utils.py:157 in aio_sts_assume_role
2024-08-02T22:54:11.205Z src_role = await sts.assume_role(
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/aiobotocore/client.py:391 in _make_api_call
2024-08-02T22:54:11.205Z http, parsed_response = await self._make_request(
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/aiobotocore/client.py:419 in _make_request
2024-08-02T22:54:11.205Z return await self._endpoint.make_request(
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/aiobotocore/endpoint.py:99 in _send_request
2024-08-02T22:54:11.205Z success_response, exception = await self._get_response(
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/aiobotocore/endpoint.py:141 in _get_response
2024-08-02T22:54:11.205Z success_response, exception = await self._do_get_response(
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/aiobotocore/endpoint.py:217 in _do_get_response
2024-08-02T22:54:11.205Z parsed_response = parser.parse(
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/botocore/parsers.py:250 in parse
2024-08-02T22:54:11.205Z parsed = self._do_error_parse(response, shape)
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/botocore/parsers.py:559 in _do_error_parse
2024-08-02T22:54:11.205Z root = self._parse_xml_string_to_dom(xml_contents)
2024-08-02T22:54:11.205Z ./../../venv/lib/python3.9/site-packages/botocore/parsers.py:508 in _parse_xml_string_to_dom
2024-08-02T22:54:11.205Z raise ResponseParserError(
2024-08-02T22:54:11.205Z botocore.parsers.ResponseParserError: Unable to parse response (no element found: line 1, column 0), invalid XML received. Further retries may succeed:
2024-08-02T22:54:11.205Z b''

I need to understand this functionality a little better still because I'm not sure how aiobotocore can delegate retries to a dependency that implements synchronous calls. Do you know if there is some kind of inversion of control mechanism where botocore lets aiobotocore handle the aio requests while it handles the retry logic?

In the meantime, manually setting up retries where I see these errors should work for my use-case.

thehesiod commented 1 month ago

let's figure this out. is it possible to put a breakpoint and debug? if you can capture the response it would help. or modify code to add a conditional breakpoint statement. I can probably look deeper into this in a couple of days

sfc-gh-afedorov commented 1 month ago

Yes, that's how I knew the response status was 429. Let me know if you'd like me to gather any other details.

thehesiod commented 1 month ago

if it happens in aiobotocore you can definitely try reproing it with botocore, they're there same just one is async. most likely like stated above is actually a botocore bug. we can try porting the relevant unittests to ensure this is the case

thehesiod commented 1 month ago

check this out: https://github.com/boto/botocore/issues/1830

thehesiod commented 1 month ago

given someone reproduced this in botocore going to close as a botocore issue. Please report there with botocore example and let us know when this is fixed!