apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 957 forks source link

[flink] PartitionMarkDone enables the use of various partition trigge… #4386

Closed Aitozi closed 3 weeks ago

Aitozi commented 3 weeks ago

…r strategies.

Purpose

Linked issue: close #xxx

In our company, we have encountered an issue with the HMS partition statistic being incorrect. This is because during the writing process, we only update the metastore partition when it is first written to. Therefore, we would like to implement the PartitionMarkDone strategy to update the statistics in HMS after a short idle period for each partition. We need a separate configuration for PartitionMarkDone due to differing requirements:

Tests

API and Format

Documentation

Aitozi commented 3 weeks ago

WDYT? @JingsongLi

JingsongLi commented 3 weeks ago

Hi @Aitozi , can you explain this pr from API level? Or you are modifying current behavior?

Aitozi commented 3 weeks ago

@JingsongLi The behavior is not change. In this PR, I extract the PartitionTrigger interface, The default is PartitionMarkDoneTrigger.

In StoreCommitter, it acts the PartitionCollector, it can trigger based on different config such as partition-mark-done or hms-report