apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.46k stars 2.43k forks source link

[SUPPORT] org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3a://<path>/.hoodie/20240908172432285.replacecommit.requested #11917

Open kishoreraj05 opened 2 months ago

kishoreraj05 commented 2 months ago

1st Exception in the Job: 24/09/09 09:09:24 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20240909085125898__deltacommit__COMPLETED__20240909085129000]} 24/09/09 09:09:24 INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups 24/09/09 09:09:24 INFO ClusteringUtils: Found 0 files in pending clustering operations 24/09/09 09:09:24 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY 24/09/09 09:09:24 INFO FileSystemViewManager: Creating in-memory based Table View 24/09/09 09:09:24 WARN BaseHoodieWriteClient: Processing rollback plan failed for [==>20240908172440784__rollback__INFLIGHT__20240908172442000], skip the plan org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3a://path-xxxx/.hoodie/20240908172432285.replacecommit.requested at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:803) at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:310) at org.apache.hudi.common.util.ClusteringUtils.getRequestedReplaceMetadata(ClusteringUtils.java:100) at org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:116) at org.apache.hudi.client.BaseHoodieTableServiceClient.getPendingRollbackInfos(BaseHoodieTableServiceClient.java:859) at org.apache.hudi.client.BaseHoodieTableServiceClient.getPendingRollbackInfos(BaseHoodieTableServiceClient.java:820) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:914) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:940) at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:222) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:939) at org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:932) at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:429) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)

Job keep running with 1st exception for ~1 hour.

2nd Exception Caused by: java.io.InterruptedIOException: open s3a://path-xxxx/.hoodie/metadata/files/.files-0000-0_20240908172145847001.log.66_0-1993-36542 at 0 on s3a://path/.hoodie/metadata/files/.files-0000-0_20240908172145847001.log.66_0-1993-36542: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException(S3AUtils.java:389) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:196) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:117) at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:226) at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:392) at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:228) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:115) at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:352) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:412) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:348) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:226) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:270) at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:384) at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:503) at java.base/java.io.DataInputStream.read(Unknown Source) at java.base/java.io.DataInputStream.readFully(Unknown Source) at org.apache.hudi.common.table.log.HoodieLogFileReader.hasNextMagic(HoodieLogFileReader.java:404) at org.apache.hudi.common.table.log.HoodieLogFileReader.readMagic(HoodieLogFileReader.java:391) at org.apache.hudi.common.table.log.HoodieLogFileReader.hasNext(HoodieLogFileReader.java:376) ... 45 more Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5437) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5384) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1519) at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$reopen$0(S3AInputStream.java:227) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:115) ... 61 more

rd Exception after that job fail: `Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 23.0 failed 4 times, most recent failure: Lost task 9.3 in stage 23.0 (TID 468) (172.30.147.40 executor 8): org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve files in partition s3a://path/dt=2024-09-09/orgid=xxxxx from metadata at org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:147) at org.apache.hudi.metadata.HoodieMetadataFileSystemView.listPartition(HoodieMetadataFileSystemView.java:67) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$16(AbstractTableFileSystemView.java:433) at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(Unknown Source) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:424) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.getLatestBaseFilesBeforeOrOn(AbstractTableFileSystemView.java:661) at org.apache.hudi.table.action.commit.UpsertPartitioner.getSmallFiles(UpsertPartitioner.java:301) at org.apache.hudi.table.action.commit.UpsertPartitioner.lambda$getSmallFilesForPartitions$f1d92f9e$1(UpsertPartitioner.java:282) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$pairFunToScalaFun$1(JavaPairRDD.scala:1073) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:414) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:201) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:117) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:76) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:466) at org.apache.hudi.metadata.HoodieMetadataLogRecordReader$Builder.build(HoodieMetadataLogRecordReader.java:219) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:498) at org.apache.hudi.metadata.HoodieBackedTableMetadata.openReaders(HoodieBackedTableMetadata.java:429) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getOrCreateReaders(HoodieBackedTableMetadata.java:414) at org.apache.hudi.metadata.HoodieBackedTableMetadata.lookupKeysFromFileSlice(HoodieBackedTableMetadata.java:291) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:255) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:145) at org.apache.hudi.metadata.BaseTableMetadata.fetchAllFilesInPartition(BaseTableMetadata.java:348) at org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:145) ... 35 more Caused by: java.lang.IllegalStateException: Block has already been inflated at org.apache.hudi.common.util.ValidationUtils.checkState(ValidationUtils.java:76) at org.apache.hudi.common.table.log.block.HoodieLogBlock.inflate(HoodieLogBlock.java:292) at org.apache.hudi.common.table.log.block.HoodieLogBlock.inflate(HoodieLogBlock.java:303) at org.apache.hudi.common.table.log.block.HoodieDataBlock.readRecordsFromBlockPayload(HoodieDataBlock.java:170) at org.apache.hudi.common.table.log.block.HoodieDataBlock.getRecordIterator(HoodieDataBlock.java:132) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.getRecordsIterator(AbstractHoodieLogRecordReader.java:958) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:780) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:825) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:403) ... 49 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407)`

kishoreraj05 commented 2 months ago

I tried looking for the file s3a://path-xxxx/.hoodie/20240908172432285.replacecommit.requested under .hoodie folder but I did not find it. Neither I found any file with the file-name start with 20240908172432285

kishoreraj05 commented 2 months ago

attached the name of the files present under .hoodie folder. files_under_dot_hoodie_folder.txt

kishoreraj05 commented 2 months ago

I have already confirmed that no two parallel jobs are writing to the same path.

ad1happy2go commented 2 months ago

@kishoreraj05 Please share the writer configurations and if possible attach the zipped hoodie timeline files.

kishoreraj05 commented 2 months ago

@ad1happy2go write-configurations: "hoodie.datasource.hive_sync.database" = ${outputDbName}, "hoodie.datasource.hive_sync.table" = ${outputTableName}, "hoodie.table.name" = ${outputTableName}, "hoodie.index.type" = "BLOOM", "hoodie.write.markers.type" = "direct", "hoodie.embed.timeline.server" = "false", "hoodie.datasource.write.recordkey.field" = "ts,xxxx,xxxx,xxxx,xxxx", "hoodie.datasource.write.precombine.field" = "ts", "hoodie.datasource.write.partitionpath.field" = "dt,xxxx", "hoodie.datasource.hive_sync.partition_fields" = "dt,xxxx", "hoodie.datasource.write.operation" = "upsert", "hoodie.cleaner.policy" = "KEEP_LATEST_COMMITS", "hoodie.datasource.write.storage.type" = "COPY_ON_WRITE", "hoodie.datasource.write.table.type" = "COPY_ON_WRITE", "hoodie.cleaner.commits.retained" = "5", "hoodie.cleaner.parallelism" = ${others.hudiCleanerParallelism}, "hoodie.datasource.write.hive_style_partitioning" = "true", "hoodie.datasource.hive_sync.enable" = "true", "hoodie.datasource.hive_sync.partition_extractor_class" = "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.use_jdbc" = "false", "hoodie.datasource.hive_sync.mode" = "hms", "hoodie.datasource.hive_sync.metastore.uris" = ${hiveUrl}, "hoodie.clean.invalid.duplicates" = "true", "hoodie.metadata.enable" = ${others.hudiMetadataEnable}

We are also running delete partitions. We are making sure that when the delete partition is running the stream job should not be running.

Delete partition write configurations: "hoodie.datasource.write.operation" = "delete_partition" "hoodie.table.name" = ${outputTableName}, "hoodie.datasource.hive_sync.database" = ${outputDbName}, "hoodie.datasource.hive_sync.table" = ${outputTableName}, "hoodie.datasource.hive_sync.enable" = "True", "hoodie.datasource.hive_sync.use_jdbc" = "false", "hoodie.datasource.hive_sync.mode" = "hms", "hoodie.datasource.hive_sync.metastore.uris" = ${hiveUrl}, "hoodie.datasource.write.hive_style_partitioning" = "true", "hoodie.datasource.hive_sync.partition_extractor_class" = "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.write.storage.type" = "COPY_ON_WRITE", "hoodie.datasource.write.table.type" = "COPY_ON_WRITE", "hoodie.cleaner.policy" = "KEEP_LATEST_COMMITS", "hoodie.cleaner.commits.retained" = "10", "hoodie.datasource.write.partitionpath.field" = "dt,xxxx", "hoodie.datasource.write.partitions.to.delete" = "dt=$dt/xxxx=*", "hoodie.datasource.hive_sync.partition_fields" = "dt,xxxx", "hoodie.metadata.enable" = ${others.hudiMetadataEnable}

kishoreraj05 commented 2 months ago

@ad1happy2go we can not be able to share the hoodie timeline files.

ad1happy2go commented 2 months ago

@kishoreraj05 Which hudi and spark version you are using?

kishoreraj05 commented 2 months ago

@ad1happy2go we are using hudi-spark3.3-bundle_2.12

ad1happy2go commented 2 months ago

@kishoreraj05 But which Hudi version? I can't see you specified that.

kishoreraj05 commented 2 months ago

@ad1happy2go my bad I missed to update the hudi version.

Hudi Version - 0.14.0

yihua commented 2 months ago

The issue of Block has already been inflated is due to a bug that is fixed by #7434.

kishoreraj05 commented 2 months ago

@yihua, could you let me know in which version this bug fix will be released?

ad1happy2go commented 2 months ago

@kishoreraj05 We will add to next minor releases which would be 0.15.1