Closed jaidisido closed 4 years ago
thanks @jaidisido! We will try to overcome it!
Fix done on the development branch available through:
pip install git+https://github.com/awslabs/aws-data-wrangler.git@dev
Will be officially available on next version 1.2.0
. Thanks!
Released on version 1.2.0
Many thanks @igorborgest will test it on my end and revert back!
Describe the bug I have two parquet files in S3 under a given prefix.
When running this command in Lambda:
columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/post-stage/datafabric/orderrates/ods_prod/order_rates/duration=1/', dataset=True)
I receive the following pyarrow schema validation error: TraceLog.txt
However, when running the command on individual files:
columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/post-stage/datafabric/orderrates/ods_prod/order_rates/duration=1/part-00019-6bbd0763-cb6d-4ffe-8dca-2df6670bdeaa.c000.snappy.parquet')
print(column_types)
{'Op': 'string', 'last_modified_at': 'timestamp', 'orderrate_id': 'bigint', 'marketplace_id': 'bigint', 'start': 'timestamp', 'order_count': 'int', 'smoothed': 'double', 'time_Since_Last_Order': 'int'}
and
columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/post-stage/datafabric/orderrates/ods_prod/order_rates/duration=1/part-00019-e18e0c01-1997-482b-ba1d-3176098e99c3.c000.snappy.parquet')
print(column_types)
{'last_modified_at': 'timestamp', 'orderrate_id': 'bigint', 'marketplace_id': 'bigint', 'start': 'timestamp', 'order_count': 'int', 'smoothed': 'double', 'time_Since_Last_Order': 'int', 'Op': 'string'}
One can see that the schemas are identical but the order of the columns is different in the two dictionaries.
Sorting the dictionary by its keys would solve the issue but based on the error log:
File "/opt/python/pyarrow/parquet.py", line 1113, in validate_schemas dataset_schema))
the schema validation is done in pyarrow which I doubt can be changed.