Hi! Currently when writting a parquet dataset with mode overwrite / overwrite_partitions it creates a race condition between the writter and any reader (aws-wranlger / Spark / Athena for example) as aws-wrangler first removes the files in each partition and then it creates objects with new random UUID-based names.
This behaviour is quite unsafe as any reader listing the object in the overwrite moment and then trying to read them will fail with some of these errors (or worse, it will fail silently because it just listed the path after aws-wrangler removed all the files, and sees and empty dataset):
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Athena: HIVE_CANNOT_OPEN_SPLIT errors
etc.
We would like a new option to ensure that in overwrite & overwrite_partitions modes aws-wrangler does a safe, deterministic & atomical replacement of the destinations object, this could be done using this method:
Having deterministic output names (for example part-0.parquet, part-1.parquet).
Atomically replacing any existing files in the output path.
Finally doing the clean-up of any extra files that are not expected in the output path (if in this new upload there are less part files, for example).
This would avoid the vast majority of race-conditions as in most cases the number of parts would stay the same or increment in case of a typical overwrite.
Hi! Currently when writting a parquet dataset with mode
overwrite
/overwrite_partitions
it creates a race condition between the writter and any reader (aws-wranlger / Spark / Athena for example) as aws-wrangler first removes the files in each partition and then it creates objects with new random UUID-based names.This behaviour is quite unsafe as any reader listing the object in the overwrite moment and then trying to read them will fail with some of these errors (or worse, it will fail silently because it just listed the path after aws-wrangler removed all the files, and sees and empty dataset):
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
HIVE_CANNOT_OPEN_SPLIT
errorsWe would like a new option to ensure that in
overwrite
&overwrite_partitions
modes aws-wrangler does a safe, deterministic & atomical replacement of the destinations object, this could be done using this method:part-0.parquet
,part-1.parquet
).This would avoid the vast majority of race-conditions as in most cases the number of parts would stay the same or increment in case of a typical overwrite.
// cc. @jack-dell