apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
748 stars 261 forks source link

[AMORO-2925] Fix the issue of clean-orphan-file mistakenly deleting data files. #2924

Closed lintingbin closed 3 weeks ago

lintingbin commented 3 weeks ago

Why are the changes needed?

When creating an Iceberg table on OSS and executing clean-orphan-file, there are normal data_files (just created but not yet committed to Iceberg) being cleaned up. The clean-orphan-file.min-existing-time-minutes parameter is not taking effect.

Through debugging, it was found that because OSS does not record the file access time, getAccessTime returns 0, causing the clean-orphan-file.min-existing-time-minutes parameter to become ineffective.

At the same time, the listPrefix function of Iceberg also uses getModificationTime. So using getModificationTime should be a better choice.

Brief change log

How was this patch tested?

Documentation

zhoujinsong commented 3 weeks ago

@lintingbin Thanks for reporting this bug and trying to fix it.

I suggest you create an issue and link this PR to it. It will help a lot when others try to search for the same issues.

lintingbin commented 3 weeks ago

@zhoujinsong The issue has been added and linked to this PR.

zhoujinsong commented 3 weeks ago

@zhoujinsong The issue has been added and linked to this PR.

Thanks for that. We can add a prefix [AMORO-${issue number}] to the PR's title(I have added it for you this time).