aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.94k stars 702 forks source link

Hitting S3 Request Rate Limits #1621

Closed liquidcarbon closed 2 years ago

liquidcarbon commented 2 years ago

Hi all, we have an issue that is rather difficult to reproduce and diagnose.

The question is: When executing awswrangler.s3.to_parquet(dataset=True), what HTTP requests are executed, and is there a way to bring them to light and suppress/throttle some of them?

Setup

To illustrate, a batch of 2 events should create the following objects (one file per partition):

s3://path/to/table1/event=1/uuid.snappy.parquet
s3://path/to/table2/event=1/uuid.snappy.parquet
...
s3://path/to/table7/event=1/uuid.snappy.parquet
s3://path/to/table8/event=1/uuid.snappy.parquet

s3://path/to/table1/event=2/uuid.snappy.parquet
s3://path/to/table2/event=2/uuid.snappy.parquet
...
s3://path/to/table7/event=2/uuid.snappy.parquet
s3://path/to/table8/event=2/uuid.snappy.parquet

Problem

About 10-40% of the executions fail with various errors that boil down to "you are trying to write too fast" (stack trace below), somehow exceeding the quota of "Amazon S3 supports a request rate of 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket"

I've tried to follow the stack trace to understand where exactly we're doing so many reads or writes - and I can't figure it out.

Workaround

We've had some success with tenacity.retry with exponential backoff. But it was the last resort. The whole thing still haunts me and I want to understand why it happens.

Stack trace

--
Traceback (most recent call last):
... in write_to_athena
response = awr.s3.to_parquet(
File "/usr/local/lib/python3.9/site-packages/awswrangler/_config.py", line 450, in wrapper
return function(**args)
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_write_parquet.py", line 637, in to_parquet
paths, partitions_values = _to_dataset(
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_write_dataset.py", line 228, in _to_dataset
paths, partitions_values = _to_partitions(
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_write_dataset.py", line 87, in _to_partitions
proxy.write(
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_write_concurrent.py", line 48, in write
self._results += func(boto3_session=boto3_session, **func_kwargs)
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_write_parquet.py", line 188, in _to_parquet
writer.write_table(table)
File "/usr/local/lib/python3.9/contextlib.py", line 126, in __exit__
next(self.gen)
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_write_parquet.py", line 75, in _new_writer
writer.close()
File "/usr/local/lib/python3.9/contextlib.py", line 126, in __exit__
next(self.gen)
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_fs.py", line 588, in open_s3_object
s3obj.close()
File "/usr/local/lib/python3.9/site-packages/awswrangler/s3/_fs.py", line 474, in close
_utils.try_it(
File "/usr/local/lib/python3.9/site-packages/awswrangler/_utils.py", line 343, in try_it
return f(**kwargs)
File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 395, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 725, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 5): Please reduce your request rate.
liquidcarbon commented 2 years ago

The above was observed with awswrangler 2.15

Reading the code today (2.16.1), it seems like awswrangler.s3._write_dataset._to_partitions should be called, right here: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/s3/_write_dataset.py#L48 - it seems to using lakeformation Client for which I do not have the permissions (An error occurred (AccessDeniedException) when calling the GetTableObjects operation: User: arn:aws:iam:: ... is not authorized to perform: lakeformation:GetTableObjects because no identity-based policy allows the lakeformation:GetTableObjects action).

For the current and future versions of aws-sdk-pandas, do I need the lakeformation IAM roles?

jaidisido commented 2 years ago

Based on the stack trace, the PUT call happens here. As you can see in the code, the call is already wrapped within a try_it method which handles exponential backoff and delays the calls, so the load submitted must really be high in your case. Perhaps turning on logging can help you measure the load further.

The LakeFormation issue is strange because that was introduced a while back and there shouldn't be any difference between 2.15 and 2.16. You only enter the condition if working on a Governed Glue table, so I am not sure why you are encountering this unless your Glue table is indeed Governed...

kukushking commented 2 years ago

My 2 cents: try use_threads=False. Atm each partition is written to S3 concurrently and each thread handles their own retries. If you're already doing too much in one go, then the retries would just repeat that after roughly the same period of time.

kukushking commented 2 years ago

Closing due to inactivity.