apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.45k stars 961 forks source link

[spark][flink] Introduce orphan file cleaning local and distributed mode #4285

Closed bknbkn closed 1 month ago

bknbkn commented 1 month ago

Purpose

For some small tables, using distributed mode directly will waste a lot of resources. Users should be allowed to choose the remove orphan clean mode (local or distributed).

Tests

Add UT in RemoveOrphanFilesProcedureTest: Paimon procedure: remove orphan files with mode

API and Format

Added usage methods:

CALL sys.remove_orphan_files(table => 'default.T', older_than => '2023-10-31 12:00:00', dry_run => true, parallelism => '5', mode => 'local')
JingsongLi commented 1 month ago

Re-open to trigger test.

bknbkn commented 1 month ago

CI has been passed, could you review it again?Thanks @JingsongLi

JingsongLi commented 1 month ago

+1