apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.47k stars 2.43k forks source link

[SUPPORT] Failed to rollback error in Hudi MOR table. #8677

Open PhantomHunt opened 1 year ago

PhantomHunt commented 1 year ago

Hi Team Hudi, Today we changed our Hudi merge-on-read(mor) table (0.13.0) configuration in our hudi_writer job.

Before:

hudi_config = {"fs.s3a.impl""fs.s3a.impl" 
    'hoodie.datasource.write.operation': 'upsert',  
    'hoodie.datasource.write.precombine.field': 'cdc_timestamp',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
    'hoodie.schema.on.read.enable' : "true",
    'hoodie.datasource.write.reconcile.schema' : "true",
    'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.recordkey.field': id,
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.upsert.shuffle.parallelism': 200,
    'hoodie.keep.max.commits': 50,
    'hoodie.keep.min.commits': 40,
    'hoodie.cleaner.commits.retained': 30
}

After:

hudi_config = {"fs.s3a.impl""fs.s3a.impl"
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'cdc_timestamp',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
    'hoodie.schema.on.read.enable' : "true",
    'hoodie.datasource.write.reconcile.schema' : "true",
    'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.recordkey.field': id,
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.upsert.shuffle.parallelism': 200,
    **'hoodie.keep.max.commits': 20,
    'hoodie.keep.min.commits': 11,
    'hoodie.cleaner.commits.retained': 10,
    'hoodie.compact.inline.max.delta.commits' : 10,
    'hoodie.log.compaction.inline' : "true",
    'hoodie.log.compaction.blocks.threshold' : 50**
}

We are getting errors after this change - -> Bad request response due to client view behind server view. Last known instant from client was 20230508085556469 but server has the following timeline [[20230504231649917__clean__COMPLETED], [20230504233121181__clean__COMPLETED], [20230505060150491__clean__COMPLETED], [20230505125204529__clean__COMPLETED], [20230505125658677__clean__COMPLETED], [20230505142315642__clean__COMPLETED], [20230505150117271__clean__COMPLETED], [20230505152529387__deltacommit__COMPLETED], [20230505153528490__deltacommit__COMPLETED], [20230505153612056__clean__COMPLETED], [20230505154028493__deltacommit__COMPLETED], [20230505154110705__clean__COMPLETED], [20230505160621946__deltacommit__COMPLETED], [20230505160656523__clean__COMPLETED], [20230505162028078__deltacommit__COMPLETED], [20230508052013526__deltacommit__COMPLETED], [20230508061027670__deltacommit__COMPLETED], [20230508061528064__deltacommit__COMPLETED], [20230508061607863__clean__COMPLETED], [20230508062527946__deltacommit__COMPLETED], [20230508063027621__deltacommit__COMPLETED], [20230508063527411__deltacommit__COMPLETED], [20230508075122470__deltacommit__COMPLETED], [20230508075206444__clean__COMPLETED], [20230508080027707__deltacommit__COMPLETED], [20230508080644245__deltacommit__COMPLETED], [20230508082114443__deltacommit__COMPLETED], [20230508082527941__deltacommit__COMPLETED], [20230508083028439__deltacommit__COMPLETED], [20230508083526967__deltacommit__COMPLETED], [20230508084027830__deltacommit__COMPLETED], [20230508084628273__deltacommit__COMPLETED], [==>20230508084702100__logcompaction__INFLIGHT], [20230508085556469__deltacommit__COMPLETED]]

and later on got this error -

An error occurred while calling o247.save.
: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback s3a://***/assets/asset commits 20230508084702100

Refer to snippet for detailed log (Hudi_error_after_change.txt)

We reverted the configuration back to the old one and now we are encountering this error -

An error occurred while calling o166.save.
: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback s3a://***/asset commits 20230508084702100
    at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:823)
    at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:727)
    at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:711)
    at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:706)
    at org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:836)
    at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:156)
    at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:835)
    at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:828)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:362)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hudi.exception.HoodieRollbackException: Found commits after time :20230508084702100, please rollback greater commits first
    at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.validateRollbackCommitSequence(BaseRollbackActionExecutor.java:172)
    at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.doRollbackAndGetStats(BaseRollbackActionExecutor.java:211)
    at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:108)
    at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:135)
    at org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:194)
    at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:807)

Environment Description

ad1happy2go commented 1 year ago

@PhantomHunt Can you share us the timeline.

PhantomHunt commented 1 year ago

@PhantomHunt Can you share us the timeline.

I have shared the timelines in the above error description & log.

The specific logcompaction timeline is this [==>20230508084702100logcompactionINFLIGHT]

Other than this do you need any other timeline?

ad1happy2go commented 1 year ago

@PhantomHunt This is the common error which hudi users faces when two concurrent writers or table services run at same time and table gets into bad state.

Similar issue here - https://github.com/apache/hudi/issues/9213

PhantomHunt commented 1 year ago

Okay. But we have only one writer and its a non-streaming job.