apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.46k stars 2.43k forks source link

[SUPPORT]hudi 0.12.2 sometimes appear org.apache.hudi.exception.HoodieIOException: IOException when reading log file #8912

Open zyclove opened 1 year ago

zyclove commented 1 year ago

Describe the problem you faced hudi timing spark-sql scheduling tasks sometimes appear org.apache.hudi.exception.HoodieIOException: IOException when reading log file.

To Reproduce

Steps to reproduce the behavior:

1.spark-sql hudi task execute once in half an hour.

  1. sometimes error with log file missing, as follow. 3.just touch miss file, rerun is ok. but it is not a normal work.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error. 3/06/09 02:39:16 INFO BlockManagerInfo: Added rdd_7_16 in memory on ip-172-16-13-109.us-west-2.compute.internal:42539 (size: 2.2 KiB, free: 8.4 GiB) 23/06/09 02:39:16 WARN TaskSetManager: Lost task 93.3 in stage 0.0 (TID 110) (ip-172-16-12-181.us-west-2.compute.internal executor 1): org.apache.hudi.exception.HoodieIOException: IOException when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:374) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:113) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:106) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343) at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305) at org.apache.hudi.LogFileIterator.(LogFileIterator.scala:89) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:55) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.sql.execution.SQLConfInjectingRDD.compute(SQLConfInjectingRDD.scala:58) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.FileNotFoundException: No such file or directory 's3://big-data-us/hudi/bi/bi_ods/ods_smart_gateway_batch/dt=2016/.00000008-ad34-409c-9303-31ae4b4dbb41-0_20230604041023497.log.4_8-12-8134' at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:521) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:932) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:924) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:195) at org.apache.hudi.common.table.log.HoodieLogFileReader.getFSDataInputStream(HoodieLogFileReader.java:479) at org.apache.hudi.common.table.log.HoodieLogFileReader.(HoodieLogFileReader.java:115) at org.apache.hudi.common.table.log.HoodieLogFormatReader.(HoodieLogFormatReader.java:70) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:244) ... 50 more

23/06/09 02:39:16 ERROR TaskSetManager: Task 93 in stage 0.0 failed 4 times; aborting job 23/06/09 02:39:16 INFO YarnScheduler: Cancelling stage 0 23/06/09 02:39:16 INFO YarnScheduler: Killing all running tasks in stage 0: Stage cancelled 23/06/09 02:39:16 INFO YarnScheduler: Stage 0 was cancelled 23/06/09 02:39:16 INFO DAGScheduler: ShuffleMapStage 0 (mapToPair at HoodieJavaRDD.java:127) failed in 43.043 s due to Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most recent failure: Lost task 93.3 in stage 0.0 (TID 110) (ip-172-16-12-181.us-west-2.compute.internal executor 1): org.apache.hudi.exception.HoodieIOException: IOException when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:374) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:113) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:106) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343) at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305) at org.apache.hudi.LogFileIterator.(LogFileIterator.scala:89) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:55) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.sql.execution.SQLConfInjectingRDD.compute(SQLConfInjectingRDD.scala:58) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.FileNotFoundException: No such file or directory 's3://big-data-us/hudi/bi/bi_ods/ods_smart_gateway_batch/dt=2016/.00000008-ad34-409c-9303-31ae4b4dbb41-0_20230604041023497.log.4_8-12-8134' at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:521) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:932) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:924) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:195) at org.apache.hudi.common.table.log.HoodieLogFileReader.getFSDataInputStream(HoodieLogFileReader.java:479) at org.apache.hudi.common.table.log.HoodieLogFileReader.(HoodieLogFileReader.java:115) at org.apache.hudi.common.table.log.HoodieLogFormatReader.(HoodieLogFormatReader.java:70) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:244) ... 50 more

ad1happy2go commented 1 year ago

@zyclove Can you post your table configuration please.

zyclove commented 1 year ago

@zyclove Can you post your table configuration please.

hudi:ods_smart_gateway_batch->desc ╔═════════════════════════════════════════════════╤══════════════════════════════════════════════════════════════════════╗ ║ Property │ Value ║ ╠═════════════════════════════════════════════════╪══════════════════════════════════════════════════════════════════════╣ ║ basePath │ s3://tuya-big-data-us/hudi/bi/bi_ods/ods_smart_gateway_batch ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ metaPath │ s3://tuya-big-data-us/hudi/bi/bi_ods/ods_smart_gateway_batch/.hoodie ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ fileSystem │ s3 ║ ╟─────────────���───────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.compaction.payload.class │ org.apache.hudi.common.model.DefaultHoodieRecordPayload ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.metadata.partitions │ files ║ ╟──────────────────────��──────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.type │ MERGE_ON_READ ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.precombine.field │ gmt_modified ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.partition.fields │ dt ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.cdc.enabled │ false ║ ╟─────────────────────────────────────────���───────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.archivelog.folder │ archived ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.timeline.layout.version │ 1 ║ ╟─────────────────────────────────────────────────┼��─────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.checksum │ 3204991261 ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.datasource.write.drop.partition.columns │ false ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.timeline.timezone │ LOCAL ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.recordkey.fields │ id ║ ╟─────────────────────────────────────────────────┼───────────────────���──────────────────────────────────────────────────╢ ║ hoodie.table.name │ ods_smart_gateway_batch ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.partition.metafile.use.base.format │ false ║ ╟─────────────────────────────────────────────────┼───────────────────────────────────────���──────────────────────────────╢ ║ hoodie.datasource.write.hive_style_partitioning │ true ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.populate.meta.fields │ true ║ ╟─────────────────────────────────────────────────┼────────────────────────────────────────────────��─────────────────────╢ ║ hoodie.table.keygenerator.class │ org.apache.hudi.keygen.SimpleKeyGenerator ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.base.file.format │ PARQUET ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.database.name │ ║ ╟─────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.datasource.write.partitionpath.urlencode │ false ║ ╟────────���────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────╢ ║ hoodie.table.version │ 5 ║ ╚═════════════════════════════════════════════════╧══════════════════════════════════════════════════════════════════════╝

zyclove commented 1 year ago

@ad1happy2go Execuse me, could you take a look at this problem? Please help me to find out where is the problem with these configurations?

ad1happy2go commented 1 year ago

@zyclove The configuration looks okay. Any more information you can provide to help us triage better, Are you running the cleaner? What operations are you running for which you are facing this error?