apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.48k stars 2.44k forks source link

[SUPPORT] AWS Glue Sync fails on a Hudi table with > 25 partitions #9806

Closed buiducsinh34 closed 1 year ago

buiducsinh34 commented 1 year ago

Describe the problem you faced

AWS Glue Sync fails when an overwriting action is done on a Hudi table with more than 25 partitions. Looks like AWSGlue has a constraint on the "BatchDeletePartition" request, specifically the value of "PartitionsToDelete" has to be no more than 25. Reference source: https://docs.aws.amazon.com/glue/latest/webapi/API_BatchDeletePartition.html#Glue-BatchDeletePartition-request-PartitionsToDelete

To Reproduce

Steps to reproduce the behavior:

  1. Generate a Hudi table by bulk-insert with aws glue sync enabled, the number of partitions is 100. A glue table is created, namely "example_glue_table".
  2. Re-generate the table by bulk-insert with updated data, glue sync enabled.

Expected behavior

Aws glue sync fails with the error message:

org.apache.hudi.com.amazonaws.services.glue.model.ValidationException: 1 validation error detected: Value '[PartitionValueList(values=[partition1]), PartitionValueList(values=[partition2]), PartitionValueList(values=[partition3]), ...(96 more) PartitionValueList(values=[partition100])]' at 'partitionsToDelete' failed to satisfy constraint: Member must have length less than or equal to 25 (Service: AWSGlue; Status Code: 400; Error Code: ValidationException; Request ID: ...; Proxy: null)

Environment Description

Additional context N/A

Stacktrace

23/09/29 05:02:11 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
    at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61)
    at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:888)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:886)
    at org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:826)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:322)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
    at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
    at [org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org](http://org.apache.spark.sql.catalyst.plans.logical.logicalplan.org/)$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530)
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
    ...
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:760)
Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing example_glue_table
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:165)
    at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:59)
    ... 56 more
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table example_glue_table
    at org.apache.hudi.hive.HiveSyncTool.syncAllPartitions(HiveSyncTool.java:403)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:272)
    at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:174)
    at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:162)
    ... 57 more
Caused by: org.apache.hudi.aws.sync.HoodieGlueSyncException: Fail to drop partitions to example_glue_database.example_glue_table
    at org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.dropPartitions(AWSGlueCatalogSyncClient.java:222)
    at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:457)
    at org.apache.hudi.hive.HiveSyncTool.syncAllPartitions(HiveSyncTool.java:399)
    ... 60 more
Caused by: org.apache.hudi.com.amazonaws.services.glue.model.ValidationException: 1 validation error detected: Value '[PartitionValueList(values=[partition1]), PartitionValueList(values=[partition2]), PartitionValueList(values=[partition3]), ...(96 more) PartitionValueList(values=[partition100])]' at 'partitionsToDelete' failed to satisfy constraint: Member must have length less than or equal to 25 (Service: AWSGlue; Status Code: 400; Error Code: ValidationException; Request ID: ...; Proxy: null)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
    at org.apache.hudi.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
    at org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:13784)
    at org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:13751)
    at org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:13740)
    at org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.executeBatchDeletePartition(AWSGlueClient.java:406)
    at org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.batchDeletePartition(AWSGlueClient.java:375)
    at org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.dropPartitions(AWSGlueCatalogSyncClient.java:214)
    ... 62 more
)
CTTY commented 1 year ago

This looks like a valid issue, PartitionsToDelete has a hard limit of 25: https://docs.aws.amazon.com/glue/latest/webapi/API_BatchDeletePartition.html

CTTY commented 1 year ago

This is duplicate. Another GH issue reporting the same problem: https://github.com/apache/hudi/issues/9805

buiducsinh34 commented 1 year ago

Noted, thanks @CTTY for having a look.

noahtaite commented 1 year ago

@CTTY

This issue is focused on BatchDeletePartition not supporting 25+ partitions.

My issue #9805 is focused on DELETE_PARTITION Hudi Operation creating a .replacecommit that is being used as a source of truth for all future Glue Syncs. Which is logically incorrect.

buiducsinh34 commented 1 year ago

@CTTY As @nahtaite mentioned above, although the 2 issues look similar, they're focusing on different aspects and could potentially have separate solutions.

ad1happy2go commented 1 year ago

@buiducsinh34 @noahtaite Created JIRA and PR to fix the batch size -

JIRA - https://issues.apache.org/jira/browse/HUDI-6932

PR - https://github.com/apache/hudi/pull/9842

@CTTY Can you please review the PR once. Thanks.

ad1happy2go commented 1 year ago

@buiducsinh34 @noahtaite Closing this out as PR is merged. Thanks Everybody. Feel free to reopen if you still see the issue.