apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.47k stars 969 forks source link

[spark] Support distributed orphan file clean for spark #4200

Closed xuzifu666 closed 2 months ago

xuzifu666 commented 2 months ago

Purpose

Linked issue: close https://github.com/apache/paimon/issues/4184

Tests

RemoveOrphanFilesProcedureTest

API and Format

Documentation

xuzifu666 commented 2 months ago

https://github.com/apache/paimon/pull/4207 dataset code style is more graceful than rdd style, so want to use 4207 version to support the issue @ulysses-you @JingsongLi