Closed aniketnanna closed 1 year ago
@aniketnanna these are all aws managed services involved. have you filed aws support case?
@xushiyan We have 3 issues to solve as mentioned above:
a. This issue is related to Athena. Connected with support for the same.
b. Support Engineer found an error ---> "can not create year partitions from string".
This error was found only for a few records and a few tables.
c. Used the following parameter from Hudi Document into the glue job:
--sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
(reference- https://hudi.apache.org/docs/syncing_aws_glue_data_catalog/)
d. Current Status:
Even using the parameter mentioned in (c), Athena-Glue partitions are not getting added to the table even though data is written in S3 within their respective partitions.
a. Hudi version: 0.11.0 explicitly mentions that we can run DDL, such as ALTER TABLE starting with Hudi 0.11.0 using Spark SQL query.
b. Though DDL changes are allowed to be handled using Pyspark code, it is not explicitly mentioned in Hudi 0.10.1 documentation that we can run DDL like executing ALTER TABLE using Spark SQL query.
But, it provides a 'How To' documentation page to run DDL with ALTER TABLE with Spark SQL queries.
c. It is confusing to understand, whether Hudi 0.10.1 can perform ALTER TABLE DDL Queries with Spark SQL.
d. It threw an error when I tried adding a column from Spark SQL. Attached error screenshot in the case above.
e. Need your guidance to perform mainly 2 DDL changes: Add Column and Drop Column.
a. If want to upgrade to the newer version, it's not feasible to reprocess all data.
b. Need your help to upgrade the Hudi version without affecting the existing Hudi table data, if only the Hudi version upgrade can solve some of the issues and AWS is compatible with new Hudi versions above 0.10.1
a. Regarding missing data in some of the tables where data is present in the processed bucket but Athena is not able to read
b. Found the following errors in error logs Glue job:
c. The above issue is solved by adding missing partitions using ALTER TABLE ADD PARTITION in Athena. d. Not able to find the root cause of the issue where Glue and Hudi Scripts missed a few partitions randomly. e. The exact same setup is followed for other tables where it worked fine and added all partitions but with respect to this table found an anomaly.
4. Unpartitioned data records: a. Few records from a table are unpartitioned and are stored in S3 with partition name "default". i.e. year/month/day --> default/default/default b. These records are not reflected in Athena but can be queried from Spark SQL
Please help on including the unpartitioned data return in query in Athena. What configurations or approach should help here?
a. Few records from a table are unpartitioned and are stored in S3 with partition name "default". i.e. year/month/day --> default/default/default b. These records are not reflected in Athena but can be queried from Spark SQL
Hudi used to have default
partition value in case the partitionpath field was null for a record in partitioned table. This value is not compatible with the HIve-based engines. So, we switched to the Hive sentinel value __HIVE_DEFAULT_PARTITION__
in https://github.com/apache/hudi/pull/5954
There is a check during upgrade of Hudi which will fail if a partition with old default value is present in the table. When the upgrade fails, hudi-cli can be used to repair the table.
@aniketnanna After the above fix, its creating the partition with __HIVE_DEFAULT_PARTITION__
and confirmed that Athena is not missing any data.
Glue Code here - https://gist.github.com/ad1happy2go/7d982bc6e137b56ce6e6f18bdb62fd03
Closing the issue. Please try out as suggested above.
Highlight of Issues Facing:
Detailed Description of Issues:
1. Missing Data a. For around 20 tables,randomly, few records are missing in comparision to the main AWS RDS DB. 100/200 records out of millions of Data records are not available from Athena but Spark SQL shows correct counts b. For 1 or 2 tables, only 1 single records got missed out of 170 millions record c. In 1 table 2800 records missed out of 600,000 records.
2. DDL Changes: a. Add column in existing table and drop column from existing table b. Change table/column name c. Not able to add or delete column from spark.sql. ALTER TABLE ADD COLUMNS E.g. spark.sql('alter table db.table_name add columns(check_status string)')
3. Upgrade to newer version: a. Upgrade to newer version of Hudi in AWS Work environment(Cureent Hudi Version: 0.10.1) without reprocessing complete Data.
Work Requirement :
Details of Work Environment:
Migrating Postgres RDS to S3 datalake using AWS DMS:
Postgres Version: 12.8
Development Environment:
Script-1 Details:
Script-2 Details: