aws-solutions / automated-data-analytics-on-aws

The Automated Data Analytics on AWS solution provides an end-to-end data platform for ingesting, transforming, managing and querying datasets. This helps analysts and business users manage and gain insights from data without deep technical experience using Amazon Web Services (AWS).
Apache License 2.0
89 stars 27 forks source link

Query new create data product ended up with error "HIVE_INVALID_METADATA: Hive metadata for table xxx is invalid: Table descriptor contains duplicate columns" #41

Closed hu-jin-aws closed 1 year ago

hu-jin-aws commented 1 year ago

SYMPTOM

When query on a data product that was created around or after April 2023 or ingested data in this time frame, user will receive error message as below and the query fails.

HIVE_INVALID_METADATA: Hive metadata for table xxx is invalid: Table descriptor contains duplicate columns

image

This happens on both existing deployment and new deployment for release 1.1.0 and below.

CAUSE

AWS Glue Crawler has introduced some behaviour change in around April 2023 that the crawler will automatically creates Partition Index after it crawled the data and create a table in Glue DataCatalog. This resulted in the failure in this solution on a step that suppose to update the partition fields after the table is created by crawler. Therefore it rendered the data table invalid for Athena to query on.

SOLUTION

This issue has been resolved in Release v1.2.0. Upgrading existing deployment to Release v1.2.0 will solve the issue for new created data product. For existing data product, it might need to be removed and data re-imported after upgrading to Release v1.2.0.

hu-jin-aws commented 1 year ago

Fixed in release 1.2.0