apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.62k stars 3.26k forks source link

[Feature] [Stream Load] Two-Phase Commit for stream load #7141

Closed weizuo93 closed 2 years ago

weizuo93 commented 2 years ago

Background

In the sample Doris application, data flow is as follows:

Flink generates checkpoints on a regular, configurable interval and then writes the checkpoint to a persistent storage system, such as HDFS. A checkpoint in Flink is a consistent snapshot of:

2021-11-17 20-43-16 的屏幕截图

In the event of a machine or Flink software failure and upon restart, the Flink application resumes processing from the most recent successfully-completed checkpoint, which causes partial data to be loaded to Doris twice and duplicate data.

To provide exactly-once semantics, Doris must provide a means to commit or rollback load that coordinate with Flink's checkpoints. So, it's better to support Two-Phase Commit(2PC) for stream load.

For the data sink to provide exactly-once guarantees, it must:

In the event of a machine or Flink software failure and upon restart, commit all stream load tasks between the most recent two checkpoints(It is ok to execute commit repeatedly for a stream load task).

Reference: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html

Design

The design of the two phase for stream load is as follows:

2021-11-02 15-55-26 的屏幕截图

Once the pre-commit is complete, we must ensure that the commit can be successful. Of course, Data releated to expired transaction which has been pre-committed could be removed.

Are you willing to submit PR?

Code of Conduct

morningman commented 2 years ago

Looking forward your PR! And as we discussed before, the pre-commit status has be cleared somehow finally.