[Feature] Support branch in paimon

FangYongs commented 1 year ago

Search before asking

[X] I searched in the issues and found nothing similar.

Motivation

Support branch for tables in paimon. In data streaming process there may be data errors and other issues, and we need to correct the data in the flow. This situation is very common and important. However, in this process, we do not want to affect existing data processing to avoid impact on users, we need to create a new data streaming process and wait for it to catch up with the data and replace the original data streaming process. The main operations can be divided into the following steps:

Create a replica table based on the specified tag/snapshot of upstream and downstream Paimon Tables
Resubmit all streaming jobs, incremental or full recovery starting from the specified offset

After discussed with @SteNicholas , we think we need to support branch in Paimon. Then we could create replica tables to avoid coping all data from specified table and increase storage space. cc @JingsongLi

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

rakesh-das08 commented 1 year ago

@FangYongs So this would be quite similar to what we have in iceberg right? A branch identifier or WAP design.

FangYongs commented 1 year ago

@rakesh-das08 Yes, it will be very useful for streaming-batch data processing and data correction in streaming process

SteNicholas commented 1 year ago

@FangYongs, thanks for driving this significant feature. I give some knowledge from internal practice.

Branch feature have the following typical application scenarios:

A snapshot named compacted-YYYYMMDD is generated on the base table every day, and users use the snapshot table to generate daily derived data tables and calculate report data. When the calculation logic downstream of the user changes, the corresponding snapshot can be selected for recalculation. You can also set the retention period to X days, and clear out expired data every day. In fact, SCD-2 can also be naturally implemented on multi-snapshot data.
An archive branch named yyyy-archived can be generated every year after the data is compressed and optimized. If our storage strategy changes (such as deleting sensitive information), we can generate a branch on this branch after performing related operations. new snapshot.
A snapshot named preprod-xx can be officially released to users after necessary quality checks, avoiding the coupling of external tools and the pipeline itself.

In business production scenarios, in order to meet the user's branch needs, more consideration should be given to ease of use and usability. For example, how does the user know that a snapshot has been published correctly? One of the problems involved is visibility. That is to say, users should be able to explicitly get the snapshot table in the entire pipeline. In addition, in the snapshot scenario, a common requirement is the precise segmentation of data. An example is that users actually do not want the data of event time on the 1st to drift to the snapshot of the 2nd, and the more hopeful method is to combine the watermark under each manifest to do fine snapshot segmentation.

In order to better meet the needs in the production environment, we have implemented the following optimizations:

Provides snapshot branch and lifecycle management feature.
On the MergeOnRead table, we could provide data accurate to 0 points for downstream offline processing on the basis of streaming writing.

Branch feature is aimed to solve the problem of real-time data entry into the lakehouse, and only supports incremental partitioning and waste of full partition storage. Meanwhile, the support of branch could bring the following benefits:

Reduce storage costs, only save one storage.
Improve efficiency, the end-to-end pipeline is simple, and the cost of engineering and productization is low.
Through time travel branch management, risk-free access.
Partition segmentation is accurate and does not require additional filter conditions for users.
A table provides incremental partitioning, full partitioning, and real-time partitioning. The SQL remains unchanged, and only the hint/option method is changed.

cc @JingsongLi

JingsongLi commented 1 year ago

Offline discussed with @FangYongs @SteNicholas @Alibaba-HZY

The role of Branch:

Data correction, the second streaming pipeline can be started in parallel for data correction, and the main branch can be covered after catching up. As described in this issue.
Enhancing Tag, for Tag simulation of traditional Hive partition tables, provide data correction capabilities on the basis of Tag, which can be used to supplement data and achieve precise segmentation capabilities.

I am +1 for this. @FangYongs can start a PIP for this.

FangYongs commented 1 year ago

@JingsongLi @SteNicholas @Alibaba-HZY I have created a PIP for this issue, please help to review the doc https://cwiki.apache.org/confluence/display/PAIMON/PIP-9%3A+Support+Branch

apache / paimon