apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
748 stars 261 forks source link

[Bug]: clean-orphan-file mistakenly deleting data files #2925

Closed lintingbin closed 3 weeks ago

lintingbin commented 3 weeks ago

What happened?

When creating an Iceberg table on OSS and executing clean-orphan-file, there are normal data_files (just created but not yet committed to Iceberg) being cleaned up. The clean-orphan-file.min-existing-time-minutes parameter is not taking effect.

Affects Versions

master

What table format are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

AMS

How to reproduce

Create an Iceberg table using OSS, then set clean-orphan-file.enabled to true.

Relevant log output

org.apache.iceberg.exceptions.NotFoundException: File does not exist: oss://xxxxxx/user/hive/warehouse/dev_game_ods.db/xxxxxx_log/data/event_time.string_trunc=2024-06-12/log_type.string=xxxxxxxxxx/00000-0-9aa7ee01-e2ca-46bd-8884-e144c8d1528b-45994.parquet
    at org.apache.iceberg.hadoop.HadoopInputFile.lazyStat(HadoopInputFile.java:164)
    at org.apache.iceberg.hadoop.HadoopInputFile.getStat(HadoopInputFile.java:200)
    at org.apache.iceberg.parquet.ParquetIO.file(ParquetIO.java:51)
    at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:238)
    at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:81)
    at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:71)
    at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:91)
    at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.hasNext(CloseableIterable.java:257)
    at org.apache.iceberg.io.CloseableIterable$7$1.hasNext(CloseableIterable.java:197)
    at org.apache.iceberg.io.CloseableIterable$7$1.hasNext(CloseableIterable.java:197)
    at org.apache.amoro.optimizing.AbstractRewriteFilesExecutor.rewriterDataFiles(AbstractRewriteFilesExecutor.java:150)
    at org.apache.amoro.table.TableMetaStore.call(TableMetaStore.java:234)
    at org.apache.amoro.table.TableMetaStore.lambda$doAs$0(TableMetaStore.java:209)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:360)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1855)
    at org.apache.amoro.table.TableMetaStore.doAs(TableMetaStore.java:209)
    at org.apache.amoro.io.AuthenticatedHadoopFileIO.doAs(AuthenticatedHadoopFileIO.java:202)
    at org.apache.amoro.optimizing.AbstractRewriteFilesExecutor.execute(AbstractRewriteFilesExecutor.java:108)
    at org.apache.amoro.optimizing.AbstractRewriteFilesExecutor.execute(AbstractRewriteFilesExecutor.java:64)
    at org.apache.amoro.optimizer.common.OptimizerExecutor.executeTask(OptimizerExecutor.java:149)
    at org.apache.amoro.optimizer.spark.SparkOptimizingTaskFunction.call(SparkOptimizingTaskFunction.java:45)
    at org.apache.amoro.optimizer.spark.SparkOptimizingTaskFunction.call(SparkOptimizingTaskFunction.java:33)
    at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

lintingbin commented 3 weeks ago

Fix by https://github.com/apache/amoro/pull/2924