aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.94k stars 701 forks source link

Race conditions when writing / reading parquet datasets #3021

Open pvieito opened 1 week ago

pvieito commented 1 week ago

Hi! Currently when writting a parquet dataset with mode overwrite / overwrite_partitions it creates a race condition between the writter and any reader (aws-wranlger / Spark / Athena for example) as aws-wrangler first removes the files in each partition and then it creates objects with new random UUID-based names.

This behaviour is quite unsafe as any reader listing the object in the overwrite moment and then trying to read them will fail with some of these errors (or worse, it will fail silently because it just listed the path after aws-wrangler removed all the files, and sees and empty dataset):

We would like a new option to ensure that in overwrite & overwrite_partitions modes aws-wrangler does a safe, deterministic & atomical replacement of the destinations object, this could be done using this method:

This would avoid the vast majority of race-conditions as in most cases the number of parts would stay the same or increment in case of a typical overwrite.

// cc. @jack-dell