Open Toroidals opened 3 months ago
@Toroidals How are you ingesting your data the first time and how are you syncing the table to metastore?
I'm also was able to reproduce the issue. Looks like it's very similar to this one: https://github.com/apache/hudi/issues/6804
Tips before filing an issue
Have you gone through our FAQs? yes
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
hudi-0.15.0 spark sql query mro's ro table error: Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hadoop/apps/hive/warehouse/hudi.db/hudi_qwe_cdc/partition_001/00000096-b062-4667-9c66-37fb8e3f26ef_1-8-2_20240802095544277.parquet
Error when executing SQL using Spark SQL client: spark-sql> INSERT OVERWRITE TABLE doris_dws.doris_aaaa_upd select source_document_id, max(gl_date) gl_date FROM ods_rbs.ods_rbs_qwe_cdc_ro cfad where cfad.
_flink_cdc_op
!= 'd' group by source_document_id; org.apache.spark.SparkException: Job aborted due to stage failure: Task 709 in stage 19.0 failed 4 times, most recent failure: Lost task 709.3 in stage 19.0 (TID 4209) (crpprd6hd02 executor 29): java.io.FileNotFoundException: File does not exist: hdfs://cmccprdhadoop/apps/hive/warehouse/hudi.db/hudi_rbs_qwe_cdc/arch_acct_distributions/00000376-7e58-4ee3-b251-5e2958ab37f8_0-8-2_20240801162146243.parquetIt is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://cmccprdhadoop/apps/hive/warehouse/hudi.db/hudi_rbs_qwe_cdc/arch_acct_distributions/00000376-7e58-4ee3-b251-5e2958ab37f8_0-8-2_20240801162146243.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
According to the prompt, first execute: "REFRESH TABLE ods_rbs.ods_rbs_qwe_cdc_ro" then execute the SQL: "INSERT OVERWRITE TABLE doris_dws.doris_aaaa_upd select source_document_id, max(gl_date) gl_date FROM ods_rbs.ods_rbs_qwe_cdc_ro cfad where cfad._flink_cdc_op != 'd' group by source_document_id;" still getting an error:
spark-sql>
24/08/02 16:40:26 WARN TaskSetManager: Lost task 701.0 in stage 64.0 (TID 12962) (crpprd6hd25 executor 39): java.io.FileNotFoundException: File does not exist: hdfs://cmccprdhadoop/apps/hive/warehouse/hudi.db/hudi_qwe_cdc/partition_001/00000609-2108-40f4-b999-faebd67f0706_1-8-2_20240801113831765.parquet at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1757) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1750) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1765) at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39) at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.footerFileMetaData$lzycompute$1(Spark33LegacyHoodieParquetFileFormat.scala:172) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.footerFileMetaData$1(Spark33LegacyHoodieParquetFileFormat.scala:171) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(Spark33LegacyHoodieParquetFileFormat.scala:175) at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:68) at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:586) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:677) at org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:250)
at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
24/08/02 16:40:26 ERROR TaskSetManager: Task 695 in stage 64.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 695 in stage 64.0 failed 4 times, most recent failure: Lost task 695.3 in stage 64.0 (TID 12963) (crpprd6hd25 executor 39): java.io.FileNotFoundException: File does not exist: hdfs://cmccprdhadoop/apps/hive/warehouse/hudi.db/hudi_qwe_cdc/partition_001/00000186-a762-4e67-823c-00cde3350e13_2-8-2_20240801181045754.parquet at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1757) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1750) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1765) at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39) at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.footerFileMetaData$lzycompute$1(Spark33LegacyHoodieParquetFileFormat.scala:172) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.footerFileMetaData$1(Spark33LegacyHoodieParquetFileFormat.scala:171) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(Spark33LegacyHoodieParquetFileFormat.scala:175) at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:68) at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:586) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:677) at org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:250)
at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://cmccprdhadoop/apps/hive/warehouse/hudi.db/hudi_qwe_cdc/partition_001/00000186-a762-4e67-823c-00cde3350e13_2-8-2_20240801181045754.parquet at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1757) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1750) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1765) at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39) at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.footerFileMetaData$lzycompute$1(Spark33LegacyHoodieParquetFileFormat.scala:172) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.footerFileMetaData$1(Spark33LegacyHoodieParquetFileFormat.scala:171) at org.apache.spark.sql.execution.datasources.parquet.Spark33LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(Spark33LegacyHoodieParquetFileFormat.scala:175) at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:68) at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:586) at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:677) at org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:250)
at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
To Reproduce
Steps to reproduce the behavior:
1. 2. 3. 4.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version : 0.15.0
Spark version : 3.3.2
Hive version : 3.1.3
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.