[SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables

arunvasudevan commented 1 year ago

Running inline compaction and cleaning process for MoR tables on EMR - 6.12 (i.e Hudi 0.13) Facing NoSuchElementException on a FileID. The specific file is present in S3 but that file is not in any commit timeline. This is a low frequency table so this table gets an insert record once every month and our cleaner and compactor is configured to be inline. But, the cleaner does not cleanup the file causing the writer to fail. There is another file on the same path with a different fileID that is present with an updated data. To resolve this issue we deleted thsi orphaned stale file. But its really not clear why this issue is occuring.

Here is the hoodie.properties file hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload hoodie.table.type=MERGE_ON_READ hoodie.table.metadata.partitions= hoodie.table.precombine.field=source_ts_ms hoodie.table.partition.fields= hoodie.archivelog.folder=archived hoodie.timeline.layout.version=1 hoodie.table.checksum=4106800621 hoodie.datasource.write.drop.partition.columns=false hoodie.table.timeline.timezone=LOCAL hoodie.table.name=performer_ride_join_table hoodie.table.recordkey.fields=performer_id,ride_id hoodie.compaction.record.merger.strategy=eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 hoodie.datasource.write.hive_style_partitioning=false hoodie.partition.metafile.use.base.format=false hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.populate.meta.fields=true hoodie.table.base.file.format=PARQUET hoodie.database.name= hoodie.datasource.write.partitionpath.urlencode=false hoodie.table.version=5

To Reproduce

Steps to reproduce the behavior: This does not happen regularly in the past 3 weeks it happen 2 times - 1 per week

Create a Hudi writer with the above config
set HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS.key() -> "1",
Insert a record

Expected behavior

Cleaner should have cleaned up the orphaned file.

Environment Description

Hudi version : 0.13
Spark version : 3.4.0
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No

Additional context

This is a low frequency table so this table gets an insert record once every month and our cleaner and compactor is configured to be inline. But, the cleaner does not cleanup the file causing the writer to fail. There is another file on the same path with a different fileID that is present with an updated data. To resolve this issue we deleted thsi orphaned stale file. But its really not clear why this issue is occuring.

Stacktrace

Error: 23/10/09 17:27:39 ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 174.0 failed 4 times, most recent failure: Lost task 0.3 in stage 174.0 (TID 30924) (ip-10-11-117-156.ec2.internal executor 28): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0 at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:336) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:251) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:905) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:905) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:377) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375) at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.NoSuchElementException: FileID 62b95ad5-44df-496c-8309-4d5d270f2ef6-0 of partition path does not exist. at org.apache.hudi.io.HoodieMergeHandle.getLatestBaseFile(HoodieMergeHandle.java:156)

ad1happy2go commented 1 year ago

@arunvasudevan Thanks for raising this. If I have understood it clearly then your issue is cleaner is not cleaning the file occasionally. You mentioned that you dont see that file in any commit timeline. Did you checked archived commits too?

Can you also provide your writer configurations.

arunvasudevan commented 1 year ago

Yes, checked the archive folder and it is empty in this case.

Here are the writer configurtions.

hoodie.datasource.hive_sync.database: hoodie.datasource.hive_sync.mode: HMS hoodie.datasource.write.precombine.field: source_ts_ms hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.max.file.size: 67108864 hoodie.datasource.meta.sync.enable: true hoodie.datasource.hive_sync.skip_ro_suffix: true hoodie.metadata.enable: false hoodie.datasource.hive_sync.table: hoodie.index.type: SIMPLE hoodie.clean.automatic: true hoodie.datasource.write.operation: upsert hoodie.metrics.reporter.type: CLOUDWATCH hoodie.datasource.hive_sync.enable: true hoodie.datasource.write.recordkey.field: version_id hoodie.table.name: ride_version hoodie.datasource.hive_sync.jdbcurl: jdbc:hive2://ip-:10000 hoodie.datasource.write.table.type: MERGE_ON_READ hoodie.simple.index.parallelism: 240 hoodie.write.lock.dynamodb.partition_key: hoodie.cleaner.policy: KEEP_LATEST_BY_HOURS hoodie.compact.inline: true hoodie.client.heartbeat.interval_in_ms: 600000 hoodie.datasource.compaction.async.enable: true hoodie.metrics.on: true hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.cleaner.policy.failed.writes: LAZY hoodie.keep.max.commits: 1650 hoodie.cleaner.hours.retained: 168 hoodie.write.lock.dynamodb.table: peloton-prod-hudi-write-lock hoodie.write.lock.provider: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider hoodie.keep.min.commits: 1600 hoodie.datasource.write.partitionpath.field: hoodie.compact.inline.max.delta.commits: 1 hoodie.write.concurrency.mode: optimistic_concurrency_control hoodie.write.lock.dynamodb.region: us-east-1

arunvasudevan commented 1 year ago

@ad1happy2go Let me know if you need any more info.

ad1happy2go commented 1 year ago

@arunvasudevan Are you there on hudi slack? If yes, can you message me there to have a call to understand the issue more. Thanks.

arunvasudevan commented 1 year ago

@ad1happy2go Messaged you on Hudi Slack. We can connect more about this issue in slack, Thanks!

rahil-c commented 1 year ago

cc @yihua @jonvex

nsivabalan commented 7 months ago

hey @ad1happy2go : any follow ups on this?

apache / hudi

[SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables #9861