apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 954 forks source link

[Feature] Support minor compact for dedicated compaction. #4566

Open LinMingQiang opened 15 hours ago

LinMingQiang commented 15 hours ago

Search before asking

Motivation

Why we need this.

Currently , compact action is fullCompaction in batch mode, that will merge all base file with delta file and generates a new base file. After that, we will have two copies of the full data in storage (base_file1 + delta_file1 + base_file2).

But : Sometimes we just need to merge incremental data, we allow some reduction in read performance in exchange for storage space.

Solution

This will be implemented through 3 PRs :

step 1 : Refactor compact action to support extended compact type.

step 2:Compact action supports using full_compaction to decide which compaction will be triggered FullCompaction or UniversalCompaction.

step 3:Add a new Procedure universal_compact for spark and flink

Anything else?

No response

Are you willing to submit a PR?