Open MrAladdin opened 2 weeks ago
The writing program goes through multiple deltacommit processes and finally, by observing the Spark UI, it is noted that the data flow is stalled and no further calculations are being performed.
These errors occurred when switching from version 0.14.1 to 0.15.0, and the Hudi write configurations have not changed.
Thanks for the feedback, the error stacktrace is not very detailed, it would be very helpful if you can past the "caused by" part, @nsivabalan , not sure if it is related with the rollback fixes for mor table.
Thanks for the feedback, the error stacktrace is not very detailed, it would be very helpful if you can past the "caused by" part, @nsivabalan , not sure if it is related with the rollback fixes for mor table.
The text above contains "Caused by".
Caused by: java.lang.IllegalStateException at org.apache.hudi.common.util.ValidationUtils.checkState(ValidationUtils.java:62) at org.apache.hudi.metadata.HoodieMetadataPayload.lambda$createPartitionFilesRecord$2(HoodieMetadataPayload.java:335) at java.util.HashMap.forEach(HashMap.java:1289) at org.apache.hudi.metadata.HoodieMetadataPayload.createPartitionFilesRecord(HoodieMetadataPayload.java:332) at org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$convertMetadataToFilesPartitionRecords$6(HoodieTableMetadataUtil.java:425) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1699) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at org.apache.hudi.metadata.HoodieTableMetadataUtil.convertMetadataToFilesPartitionRecords(HoodieTableMetadataUtil.java:428) at org.apache.hudi.metadata.HoodieTableMetadataUtil.convertMetadataToRecords(HoodieTableMetadataUtil.java:356) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.lambda$updateFromWriteStatuses$22(HoodieBackedTableMetadataWriter.java:920) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:861) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.updateFromWriteStatuses(HoodieBackedTableMetadataWriter.java:918) at org.apache.hudi.client.BaseHoodieClient.writeTableMetadata(BaseHoodieClient.java:257)
Thanks for the feedback, the error stacktrace is not very detailed, it would be very helpful if you can past the "caused by" part, @nsivabalan , not sure if it is related with the rollback fixes for mor table.
timeline show active --limit 1000 --with-metadata-table
, there is no record of any INFLIGHT or similar behavior under the metadata-table corresponding to the deltacommit that has been in INFLIGHT status, and it shows all as '-'.Thanks for the feedback, the error stacktrace is not very detailed, it would be very helpful if you can past the "caused by" part, @nsivabalan , not sure if it is related with the rollback fixes for mor table.
When switching from 0.14.1 to 0.15, a significant number of rollbacks were observed in the timeline, and another table experienced continuous rollbacks yesterday, preventing data from being written.
It does have a file size check in branch 0.15.0: , but the check is removed from master in commit: https://github.com/apache/hudi/pull/9765, and not sure it is related with the rollback fix: https://github.com/apache/hudi/pull/11187, cc @codope @nsivabalan , can you check whether it is related with the fix?
Looks like it is related with another issue: https://github.com/apache/hudi/issues/11594, they should belongs to the same culprits.
@MrAladdin Is it possible for you to zip and share the complete .hoodie
directory? We need to inspect the commit metadata and see how come 0-byte files being added to the files metadata.
In the meantime, you can disable metadata table to unblock yourself
hoodie.metadata.enable=false
In the meantime, you can disable metadata table to unblock yourself
hoodie.metadata.enable=false
It uses the record_index index, so it cannot be closed.
@MrAladdin Is it possible for you to zip and share the complete
.hoodie
directory? We need to inspect the commit metadata and see how come 0-byte files being added to the files metadata.
After compression, the .hoodie file is 550MB in size, which might be inconvenient to provide. The issue at https://github.com/apache/hudi/issues/11594 appears to be part of a series of related problems. You can find my specific production configuration there, which is the same as the 0.14.1 version, and 0.14.1 works correctly.
hey @MrAladdin : are you in hudi slack. we can connect and investigate faster. can you post a msg there and tag me (shivnarayan) and sagar (sagar sumit)
also, do you think you can use global simple in the mean time while we try to find the root cause and get a fix out?
also, do you think you can use global simple in the mean time while we try to find the root cause and get a fix out?
Due to my business scenario involving a large number of upsert operations (public opinion data), other index types did not perform well in previous tests. Only dynamic bucket and the newly released record_index in version 0.14 met the requirements. I have always wanted to find an index that doesn't require manual intervention or adjustment, so record_index has attracted my attention and interest. This test is mainly focused on the performance improvements of record_index in version 0.15. I can wait for this to be fixed before conducting tests. Actually, there is another issue mentioned in https://github.com/apache/hudi/issues/11567. @nsivabalan When the amount of already stored data is huge, it is also a maddening issue. You can pay attention to this as well.
hey @MrAladdin : are you in hudi slack. we can connect and investigate faster. can you post a msg there and tag me (shivnarayan) and sagar (sagar sumit)
I'm really sorry, but due to certain reasons, I am unable to help you.
@MrAladdin Or maybe a quick google meeting sync, that would be helpful.
@MrAladdin Is it possible for you to zip and share the complete
.hoodie
directory? We need to inspect the commit metadata and see how come 0-byte files being added to the files metadata.
@codope Hello, it seems like I've discovered something new, possibly related to the hoodie.allow.empty.commit parameter. There appears to be a conflict with the metadata check mechanism in version 0.15.0. I have disabled it in the code, and so far, it's been running for 4 hours without any issues. I'll monitor the situation for a while longer before providing further feedback.
@MrAladdin Or maybe a quick google meeting sync, that would be helpful.
@danny0405 Thank you for your positive attitude, thumbs up to you, hahaha. I just provided some feedback on the test results above. I am also looking into the source code and continuously trying to resolve the issue. I will do my best to identify the problem.
@nsivabalan @ad1happy2go @danny0405 @codope After reviewing the source code and observing the long-running task following the closure of the hoodie.allow.empty.commit parameter, it has been preliminarily confirmed that the anomaly is indeed caused by the empty commit.
I'm sorry, I may have misjudged. After restarting the program, a large number of continuous "update metadata fail" exceptions appeared again. I noticed that it may be caused by a rollback.
There might still be some underlying issues in certain areas, and the following exception may also occur: java.io.FileNotFoundException: File xxx/.hoodie/.temp/20240712181232182 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1104) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:147) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1175) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1172) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1182) at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:274) at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listStatus(ChRootedFileSystem.java:267) at org.apache.hadoop.fs.viewfs.ViewFileSystem.listStatus(ViewFileSystem.java:605) at org.apache.hudi.hadoop.fs.HoodieRetryWrapperFileSystem.lambda$listStatus$940984dd$1(HoodieRetryWrapperFileSystem.java:198) at org.apache.hudi.common.util.RetryHelper.start(RetryHelper.java:85) at org.apache.hudi.common.util.RetryHelper.start(RetryHelper.java:113) at org.apache.hudi.hadoop.fs.HoodieRetryWrapperFileSystem.listStatus(HoodieRetryWrapperFileSystem.java:198) at org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem.lambda$listStatus$15(HoodieWrapperFileSystem.java:377) at org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:118) at org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:376) at org.apache.hudi.storage.hadoop.HoodieHadoopStorage.listDirectEntries(HoodieHadoopStorage.java:179) at org.apache.hudi.table.marker.DirectWriteMarkers.getAppendedLogPaths(DirectWriteMarkers.java:147) at org.apache.hudi.table.action.rollback.BaseRollbackHelper.performRollback(BaseRollbackHelper.java:108) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.executeRollback(BaseRollbackActionExecutor.java:245) at org.apache.hudi.table.action.rollback.MergeOnReadRollbackActionExecutor.executeRollback(MergeOnReadRollbackActionExecutor.java:87) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.doRollbackAndGetStats(BaseRollbackActionExecutor.java:227) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:111) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) at org.apache.hudi.table.HoodieSparkMergeOnReadTable.rollback(HoodieSparkMergeOnReadTable.java:218) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1044) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1008) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:935) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:917) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:912) at org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$clean$1cda88ee$1(BaseHoodieTableServiceClient.java:739) at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:214) at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:738) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:843) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:816) at org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Here's a question for everyone: Can anyone tell me why there are INFLIGHT deltacommit but no rollback? These deltacommit should be failures, right?
@nsivabalan @ad1happy2go @danny0405 @codope There are a large number of metadata-related jobs missing from the Spark UI due to an exceptional deltacommit that has been consistently in the INFLIGHT state. Furthermore, using the command commit showfiles --commit exception_deltacommit_id
does not reveal any file write information. Additionally, the .hoodie/.temp/
directory contains folders and (all 0 size )data corresponding to these exception_deltacommit_ids.
exception :
normal :
Here's a question for everyone: Can anyone tell me why there are INFLIGHT deltacommit but no rollback? These deltacommit should be failures, right?
I guess the rollback strategy might be condifured as lazy, these are data table delta commits right?
@nsivabalan @ad1happy2go @danny0405 @codope
The reason has been found: spark.speculation conflicts with 0.15. Everything works fine when spark.speculation is disabled.
Environment Description
Hudi version : 0.15.0
Spark version : 3.5.1
Hive version : no
Hadoop version : 3.1.3
Storage (HDFS/S3/GCS..) : HDFS
Running on Docker? (yes/no) : no
Stacktrace
Add the stacktrace of the error.
24/07/08 15:02:16 ERROR HoodieStreamingSink: Micro batch id=5377 threw following exception: org.apache.hudi.exception.HoodieException: Failed to update metadata at org.apache.hudi.client.BaseHoodieClient.writeTableMetadata(BaseHoodieClient.java:262) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:294) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:239) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:108) at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1082) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:508) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$3(HoodieStreamingSink.scala:141) at scala.util.Try$.apply(Try.scala:210) at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:133) at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:237) at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:132) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:732) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:729) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:729) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:286) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239) at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211) Caused by: java.lang.IllegalStateException at org.apache.hudi.common.util.ValidationUtils.checkState(ValidationUtils.java:62) at org.apache.hudi.metadata.HoodieMetadataPayload.lambda$createPartitionFilesRecord$2(HoodieMetadataPayload.java:335) at java.util.HashMap.forEach(HashMap.java:1289) at org.apache.hudi.metadata.HoodieMetadataPayload.createPartitionFilesRecord(HoodieMetadataPayload.java:332) at org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$convertMetadataToFilesPartitionRecords$6(HoodieTableMetadataUtil.java:425) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1699) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at org.apache.hudi.metadata.HoodieTableMetadataUtil.convertMetadataToFilesPartitionRecords(HoodieTableMetadataUtil.java:428) at org.apache.hudi.metadata.HoodieTableMetadataUtil.convertMetadataToRecords(HoodieTableMetadataUtil.java:356) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.lambda$updateFromWriteStatuses$22(HoodieBackedTableMetadataWriter.java:920) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:861) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.updateFromWriteStatuses(HoodieBackedTableMetadataWriter.java:918) at org.apache.hudi.client.BaseHoodieClient.writeTableMetadata(BaseHoodieClient.java:257)