Closed liquidcarbon closed 2 years ago
The above was observed with awswrangler 2.15
Reading the code today (2.16.1), it seems like awswrangler.s3._write_dataset._to_partitions
should be called, right here: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/s3/_write_dataset.py#L48 - it seems to using lakeformation Client for which I do not have the permissions (An error occurred (AccessDeniedException) when calling the GetTableObjects operation: User: arn:aws:iam:: ... is not authorized to perform: lakeformation:GetTableObjects because no identity-based policy allows the lakeformation:GetTableObjects action
).
For the current and future versions of aws-sdk-pandas, do I need the lakeformation IAM roles?
Based on the stack trace, the PUT
call happens here. As you can see in the code, the call is already wrapped within a try_it method which handles exponential backoff and delays the calls, so the load submitted must really be high in your case. Perhaps turning on logging can help you measure the load further.
The LakeFormation issue is strange because that was introduced a while back and there shouldn't be any difference between 2.15 and 2.16. You only enter the condition if working on a Governed Glue table, so I am not sure why you are encountering this unless your Glue table is indeed Governed...
My 2 cents: try use_threads=False
. Atm each partition is written to S3 concurrently and each thread handles their own retries. If you're already doing too much in one go, then the retries would just repeat that after roughly the same period of time.
Closing due to inactivity.
Hi all, we have an issue that is rather difficult to reproduce and diagnose.
The question is: When executing
awswrangler.s3.to_parquet(dataset=True)
, what HTTP requests are executed, and is there a way to bring them to light and suppress/throttle some of them?Setup
awswrangler.s3.to_parquet(dataset=True)
To illustrate, a batch of 2 events should create the following objects (one file per partition):
Problem
About 10-40% of the executions fail with various errors that boil down to "you are trying to write too fast" (stack trace below), somehow exceeding the quota of "Amazon S3 supports a request rate of 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket"
I've tried to follow the stack trace to understand where exactly we're doing so many reads or writes - and I can't figure it out.
Workaround
We've had some success with
tenacity.retry
with exponential backoff. But it was the last resort. The whole thing still haunts me and I want to understand why it happens.Stack trace