In the sample Doris application, data flow is as follows:
read streaming data from Kafka
Execute ETL in Flink
Sink data batch to Doris by stream load
Flink generates checkpoints on a regular, configurable interval and then writes the checkpoint to a persistent storage system, such as HDFS. A checkpoint in Flink is a consistent snapshot of:
The current state of an application
The consumption progress of data stream(offset)
In the event of a machine or Flink software failure and upon restart, the Flink application resumes processing from the most recent successfully-completed checkpoint, which causes partial data to be loaded to Doris twice and duplicate data.
To provide exactly-once semantics, Doris must provide a means to commit or rollback load that coordinate with Flink's checkpoints. So, it's better to support Two-Phase Commit(2PC) for stream load.
For the data sink to provide exactly-once guarantees, it must:
write all data to Doris through several stream load tasks between two checkpoints (All data is non-visible).
commit all stream load tasks between two checkpoints(All data is visible).
In the event of a machine or Flink software failure and upon restart, commit all stream load tasks between the most recent two checkpoints(It is ok to execute commit repeatedly for a stream load task).
The design of the two phase for stream load is as follows:
First Phase:
Second Phase:
Once the pre-commit is complete, we must ensure that the commit can be successful. Of course, Data releated to expired transaction which has been pre-committed could be removed.
Background
In the sample Doris application, data flow is as follows:
stream load
Flink generates checkpoints on a regular, configurable interval and then writes the checkpoint to a persistent storage system, such as HDFS. A checkpoint in Flink is a consistent snapshot of:
offset
)In the event of a machine or Flink software failure and upon restart, the Flink application resumes processing from the most recent successfully-completed checkpoint, which causes partial data to be loaded to Doris twice and duplicate data.
To provide exactly-once semantics, Doris must provide a means to commit or rollback load that coordinate with Flink's checkpoints. So, it's better to support
Two-Phase Commit(2PC)
for stream load.For the data sink to provide exactly-once guarantees, it must:
In the event of a machine or Flink software failure and upon restart, commit all stream load tasks between the most recent two checkpoints(It is ok to execute commit repeatedly for a stream load task).
Reference: https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
Design
The design of the two phase for stream load is as follows:
Once the
pre-commit
is complete, we must ensure that thecommit
can be successful. Of course, Data releated to expired transaction which has been pre-committed could be removed.Are you willing to submit PR?
Code of Conduct