Query new create data product ended up with error "HIVE_INVALID_METADATA: Hive metadata for table xxx is invalid: Table descriptor contains duplicate columns"

SYMPTOM

When query on a data product that was created around or after April 2023 or ingested data in this time frame, user will receive error message as below and the query fails.

HIVE_INVALID_METADATA: Hive metadata for table xxx is invalid: Table descriptor contains duplicate columns

This happens on both existing deployment and new deployment for release 1.1.0 and below.

CAUSE

AWS Glue Crawler has introduced some behaviour change in around April 2023 that the crawler will automatically creates Partition Index after it crawled the data and create a table in Glue DataCatalog. This resulted in the failure in this solution on a step that suppose to update the partition fields after the table is created by crawler. Therefore it rendered the data table invalid for Athena to query on.

SOLUTION

This issue has been resolved in Release v1.2.0. Upgrading existing deployment to Release v1.2.0 will solve the issue for new created data product. For existing data product, it might need to be removed and data re-imported after upgrading to Release v1.2.0.

aws-solutions / automated-data-analytics-on-aws