apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.46k stars 966 forks source link

[core] Optimize multiple commit to reduce conflicts #4286

Closed JingsongLi closed 1 month ago

JingsongLi commented 1 month ago

Purpose

At present, if the number of files is very large and the commit interval is relatively small, and multiple jobs are written simultaneously, there will be serious competition, and even retry failures (more than ten times will fail).

This is because the data files conflicts checking may be triggered at present, which requires a relatively long time to read data files from old snapshot. If other jobs commit at this time, and repeated commit may still fail because of repeated conflicts checking.

This is very wasteful. We can actually reuse the base files, we can just read incremental files and merge it to last time base files.

Tests

API and Format

Documentation