Closed buiducsinh34 closed 1 year ago
This looks like a valid issue, PartitionsToDelete
has a hard limit of 25: https://docs.aws.amazon.com/glue/latest/webapi/API_BatchDeletePartition.html
This is duplicate. Another GH issue reporting the same problem: https://github.com/apache/hudi/issues/9805
Noted, thanks @CTTY for having a look.
@CTTY
This issue is focused on BatchDeletePartition not supporting 25+ partitions.
My issue #9805 is focused on DELETE_PARTITION Hudi Operation creating a .replacecommit that is being used as a source of truth for all future Glue Syncs. Which is logically incorrect.
@CTTY As @nahtaite mentioned above, although the 2 issues look similar, they're focusing on different aspects and could potentially have separate solutions.
@buiducsinh34 @noahtaite Created JIRA and PR to fix the batch size -
JIRA - https://issues.apache.org/jira/browse/HUDI-6932
PR - https://github.com/apache/hudi/pull/9842
@CTTY Can you please review the PR once. Thanks.
@buiducsinh34 @noahtaite Closing this out as PR is merged. Thanks Everybody. Feel free to reopen if you still see the issue.
Describe the problem you faced
AWS Glue Sync fails when an overwriting action is done on a Hudi table with more than 25 partitions. Looks like AWSGlue has a constraint on the "BatchDeletePartition" request, specifically the value of "PartitionsToDelete" has to be no more than 25. Reference source: https://docs.aws.amazon.com/glue/latest/webapi/API_BatchDeletePartition.html#Glue-BatchDeletePartition-request-PartitionsToDelete
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Aws glue sync fails with the error message:
Environment Description
Hudi version : 0.13.1
Spark version : 3.4.0
Hive version : 3.1.3
Hadoop version : 3.3.3
AWS EMR version: 6.12.0
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context N/A
Stacktrace