apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.27k stars 908 forks source link

[Feature] Support branch in paimon #1795

Open FangYongs opened 1 year ago

FangYongs commented 1 year ago

Search before asking

Motivation

Support branch for tables in paimon. In data streaming process there may be data errors and other issues, and we need to correct the data in the flow. This situation is very common and important. However, in this process, we do not want to affect existing data processing to avoid impact on users, we need to create a new data streaming process and wait for it to catch up with the data and replace the original data streaming process. The main operations can be divided into the following steps:

  1. Create a replica table based on the specified tag/snapshot of upstream and downstream Paimon Tables
  2. Resubmit all streaming jobs, incremental or full recovery starting from the specified offset

After discussed with @SteNicholas , we think we need to support branch in Paimon. Then we could create replica tables to avoid coping all data from specified table and increase storage space. cc @JingsongLi

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

rakesh-das08 commented 1 year ago

@FangYongs So this would be quite similar to what we have in iceberg right? A branch identifier or WAP design.

FangYongs commented 1 year ago

@rakesh-das08 Yes, it will be very useful for streaming-batch data processing and data correction in streaming process

SteNicholas commented 1 year ago

@FangYongs, thanks for driving this significant feature. I give some knowledge from internal practice.

Branch feature have the following typical application scenarios:

In business production scenarios, in order to meet the user's branch needs, more consideration should be given to ease of use and usability. For example, how does the user know that a snapshot has been published correctly? One of the problems involved is visibility. That is to say, users should be able to explicitly get the snapshot table in the entire pipeline. In addition, in the snapshot scenario, a common requirement is the precise segmentation of data. An example is that users actually do not want the data of event time on the 1st to drift to the snapshot of the 2nd, and the more hopeful method is to combine the watermark under each manifest to do fine snapshot segmentation.

In order to better meet the needs in the production environment, we have implemented the following optimizations:

Branch feature is aimed to solve the problem of real-time data entry into the lakehouse, and only supports incremental partitioning and waste of full partition storage. Meanwhile, the support of branch could bring the following benefits:

cc @JingsongLi

JingsongLi commented 1 year ago

Offline discussed with @FangYongs @SteNicholas @Alibaba-HZY

The role of Branch:

  1. Data correction, the second streaming pipeline can be started in parallel for data correction, and the main branch can be covered after catching up. As described in this issue.
  2. Enhancing Tag, for Tag simulation of traditional Hive partition tables, provide data correction capabilities on the basis of Tag, which can be used to supplement data and achieve precise segmentation capabilities.

I am +1 for this. @FangYongs can start a PIP for this.

FangYongs commented 1 year ago

@JingsongLi @SteNicholas @Alibaba-HZY I have created a PIP for this issue, please help to review the doc https://cwiki.apache.org/confluence/display/PAIMON/PIP-9%3A+Support+Branch