apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.36k stars 2.42k forks source link

[SUPPORT] The rollback failed because the file could not be created because the marker file already existed. #11767

Open LmrZER0 opened 1 month ago

LmrZER0 commented 1 month ago

Tips before filing an issue

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. my config: hoodie.write.concurrency.mode=optimistic_concurrency_control hoodie.cleaner.policy.failed.writes=LAZY hoodie.write.concurrency.early.conflict.detection.enable=TRUE
  2. job no restart

Expected behavior

A clear and concise description of what you expected to happen. image

2024-08-13 11:06:01.598 ERROR [pool-258-thread-1:8-thread-1] org.apache.hudi.async.HoodieAsyncService - Service shutdown with error java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback hdfs://ns1200/user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d commits 20240811184332421 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103) at org.apache.hudi.async.AsyncCleanerService.waitForCompletion(AsyncCleanerService.java:75) at org.apache.hudi.client.BaseHoodieTableServiceClient.asyncClean(BaseHoodieTableServiceClient.java:132) at org.apache.hudi.client.HoodieFlinkWriteClient.waitForCleaningFinish(HoodieFlinkWriteClient.java:344) at org.apache.hudi.sink.CleanFunction.lambda$notifyCheckpointComplete$1(CleanFunction.java:84) at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback hdfs://ns1200/user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d commits 20240811184332421 at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1061) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1008) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:935) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:917) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:912) at org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$clean$1cda88ee$1(BaseHoodieTableServiceClient.java:739) at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:214) at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:738) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:843) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:816) at org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ... 3 common frames omitted Caused by: org.apache.hudi.exception.HoodieException: Error occurs when executing flatMap at org.apache.hudi.common.function.FunctionWrapper.lambda$throwingFlatMapWrapper$1(FunctionWrapper.java:50) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721) at java.util.stream.AbstractTask.compute(AbstractTask.java:316) at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hudi.client.common.HoodieFlinkEngineContext.flatMap(HoodieFlinkEngineContext.java:141) at org.apache.hudi.table.action.rollback.BaseRollbackHelper.maybeDeleteAndCollectStats(BaseRollbackHelper.java:150) at org.apache.hudi.table.action.rollback.BaseRollbackHelper.performRollback(BaseRollbackHelper.java:115) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.executeRollback(BaseRollbackActionExecutor.java:245) at org.apache.hudi.table.action.rollback.MergeOnReadRollbackActionExecutor.executeRollback(MergeOnReadRollbackActionExecutor.java:87) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.doRollbackAndGetStats(BaseRollbackActionExecutor.java:227) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:111) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) at org.apache.hudi.table.HoodieFlinkMergeOnReadTable.rollback(HoodieFlinkMergeOnReadTable.java:158) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1044) ... 14 common frames omitted Caused by: org.apache.hudi.exception.HoodieException: Failed to create marker file hdfs://ns1007/user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d/.hoodie/.temp/20240811185848523/dt=2024-08-11/.00000168-778b-477d-b4ab-1417e067f08e_20240811182559380.log.1_13-64-0.marker.APPEND at org.apache.hudi.table.marker.DirectWriteMarkers.create(DirectWriteMarkers.java:264) at org.apache.hudi.table.marker.DirectWriteMarkers.createWithEarlyConflictDetection(DirectWriteMarkers.java:243) at org.apache.hudi.table.marker.WriteMarkers.createIfNotExists(WriteMarkers.java:135) at org.apache.hudi.table.action.rollback.BaseRollbackHelper$1.createAppendMarker(BaseRollbackHelper.java:251) at org.apache.hudi.table.action.rollback.BaseRollbackHelper$1.preLogFileOpen(BaseRollbackHelper.java:241) at org.apache.hudi.common.table.log.HoodieLogFormatWriter.getOutputStream(HoodieLogFormatWriter.java:100) at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:149) at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:140) at org.apache.hudi.table.action.rollback.BaseRollbackHelper.lambda$maybeDeleteAndCollectStats$b2977713$1(BaseRollbackHelper.java:181) at org.apache.hudi.common.function.FunctionWrapper.lambda$throwingFlatMapWrapper$1(FunctionWrapper.java:48) ... 38 common frames omitted Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d/.hoodie/.temp/20240811185848523/dt=2024-08-11/.00000168-778b-477d-b4ab-1417e067f08e_20240811182559380.log.1_13-64-0.marker.APPEND for client 10.198.21.35 already exists at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:463) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2874) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem.access$401(JDFSNamesystem.java:177) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem$5.call(JDFSNamesystem.java:1494) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem$5.call(JDFSNamesystem.java:1484) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem$CoalesceWriteThread.run(JDFSNamesystem.java:1647)

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

danny0405 commented 1 month ago

Do you have multiple jobs here? For lazy cleaning, only one cleaning is allowed now because the cleaning is not guarded by any lock currently, that means you can only enable cleaning for a singleton job.

ad1happy2go commented 1 month ago

@LmrZER0 Also, can you provide your full writer configurations?

ad1happy2go commented 1 month ago

@LmrZER0 Will you be able to provide us required info to look into this further? Please let us know in case it got resolved.

nsivabalan commented 4 weeks ago

do you have spark speculation enabled by any chance?

danny0405 commented 4 weeks ago

Even if the marker exists, we can still take the rollback, this might be an possible improvement.