apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.12k stars 839 forks source link

[Feature] Generate changelog file by copying data file when they are equal #3567

Open yunfengzhou-hub opened 2 weeks ago

yunfengzhou-hub commented 2 weeks ago

Search before asking

Motivation

There are certain cases in which the changelog file and the data file of a bucket shares the same content. For example, it happens when a paimon job is used to synchronize data from a full snapshot of a database into paimon (which means no two records have the same primary key and no merge would be performed), and the job uses input as changelog producer. In such cases, instead of writing duplicated content twice, we can generate the changelog files by copying the data file and vice versa. This optimization can help reduce the IO overhead spent on Paimon sinks and improve the throughput of related jobs.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?