aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.9k stars 692 forks source link

read_parquet_metadata incorrectly throws validation_schema error #195

Closed jaidisido closed 4 years ago

jaidisido commented 4 years ago

Describe the bug I have two parquet files in S3 under a given prefix.

When running this command in Lambda: columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/post-stage/datafabric/orderrates/ods_prod/order_rates/duration=1/', dataset=True)

I receive the following pyarrow schema validation error: TraceLog.txt

However, when running the command on individual files: columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/post-stage/datafabric/orderrates/ods_prod/order_rates/duration=1/part-00019-6bbd0763-cb6d-4ffe-8dca-2df6670bdeaa.c000.snappy.parquet') print(column_types) {'Op': 'string', 'last_modified_at': 'timestamp', 'orderrate_id': 'bigint', 'marketplace_id': 'bigint', 'start': 'timestamp', 'order_count': 'int', 'smoothed': 'double', 'time_Since_Last_Order': 'int'}

and

columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/post-stage/datafabric/orderrates/ods_prod/order_rates/duration=1/part-00019-e18e0c01-1997-482b-ba1d-3176098e99c3.c000.snappy.parquet') print(column_types) {'last_modified_at': 'timestamp', 'orderrate_id': 'bigint', 'marketplace_id': 'bigint', 'start': 'timestamp', 'order_count': 'int', 'smoothed': 'double', 'time_Since_Last_Order': 'int', 'Op': 'string'}

One can see that the schemas are identical but the order of the columns is different in the two dictionaries.

Sorting the dictionary by its keys would solve the issue but based on the error log: File "/opt/python/pyarrow/parquet.py", line 1113, in validate_schemas dataset_schema)) the schema validation is done in pyarrow which I doubt can be changed.

igorborgest commented 4 years ago

thanks @jaidisido! We will try to overcome it!

igorborgest commented 4 years ago

Fix done on the development branch available through:

pip install git+https://github.com/awslabs/aws-data-wrangler.git@dev

Will be officially available on next version 1.2.0. Thanks!

igorborgest commented 4 years ago

Released on version 1.2.0

jaidisido commented 4 years ago

Many thanks @igorborgest will test it on my end and revert back!