Open JesseAtSZ opened 2 years ago
I am also newly in seatunnel and flink, I think this feature might be interesting and useful.
The current seatunnel-connector-flink-jdbc
is just used for offline batch job, it doesn't support realtime capture the data change of database. And flink-cdc
is good at capturing the data change, if we add flink-cdc into seatunnel, this can help seatunnel do the realtime capture job.
And I try to answer your concern:
flink-cpc
is also focus on get the data change event, it doesn't care about how to use the change event.This is a good begin, wait for other community member give more suggestions.
@ruanwenjun Thank you very much for your advice, but I still have the following questions:
In my view, CDC just one special source. This source including data's change. User can use these data do anything. I think we should do the thing is create special cdc series connector (both source and sink). So user can transfer data change use seatunnel.
@CalvinKirs Can you share some information about how support flink cdc?
ping @CalvinKirs
@JesseAtSZ Any progress? If you want, I want to contribute this feature with you
@BenJFan Recently, I haven't started to write code, mainly to understand how to ensure strict consistency of transactions when using Flink CDC: Can Flink CDC guarantee MySQL transactions
In addition, for the specific code level, I still have the above three problems to be solved.
@BenJFan Recently, I haven't started to write code, mainly to understand how to ensure strict consistency of transactions when using Flink CDC: Can Flink CDC guarantee MySQL transactions
In addition, for the specific code level, I still have the above three problems to be solved.
Guaranteed strict consistency of transactions is not just one component can complete, the transaction that cdc can guarantee requires not only that the data source can be replayed (binlog can be replayed), but also the sink side to support transactions (traditional transaction or distributed transaction) or write idempotency
@BenJFan Recently, I haven't started to write code, mainly to understand how to ensure strict consistency of transactions when using Flink CDC: Can Flink CDC guarantee MySQL transactions In addition, for the specific code level, I still have the above three problems to be solved.
Guaranteed transaction is not only one component that can be completed, the transaction that cdc can guarantee requires not only that the data source can be replayed (binlog can be replayed), but also the sink side to support transactions (traditional transaction or distributed transaction) or write idempotency
Flink CDC supports binlog replay. The problem I want to solve is that the sink side can strictly guarantee the transactions on the source side, rather than simply inserting and modifying them line by line through SQL ( it only replays SQL, but it can not guarantee transactions. For example, if a sink side transaction suddenly goes down in the middle of execution, there is a problem with the data on the sink side at this time). I think there are several key points to this problem:
I have some understanding of these two questions:
Maybe we can support cdc first, then add exactly-once support.
@BenJFan Recently, I haven't started to write code, mainly to understand how to ensure strict consistency of transactions when using Flink CDC: Can Flink CDC guarantee MySQL transactions In addition, for the specific code level, I still have the above three problems to be solved.
Guaranteed transaction is not only one component that can be completed, the transaction that cdc can guarantee requires not only that the data source can be replayed (binlog can be replayed), but also the sink side to support transactions (traditional transaction or distributed transaction) or write idempotency
Flink CDC supports binlog replay. The problem I want to solve is that the sink side can strictly guarantee the transactions on the source side, rather than simply inserting and modifying them line by line through SQL ( it only replays SQL, but it can not guarantee transactions. For example, if a sink side transaction suddenly goes down in the middle of execution, there is a problem with the data on the sink side at this time). I think there are several key points to this problem:
- The source side can obtain transaction information and ensure the order
- The sink side can ensure the sequential insertion of transactions and the idempotency during fault recovery
I have some understanding of these two questions:
- I found that the changelog event in debezium contains transaction information, but the transaction information in Flink's SourceRecord is not complete. I'm considering whether to improve the transaction information of Flink CDC, then construct different queues through different transaction id, and finally submit in the order of gtids?
- The idempotency of fault recovery is mainly reflected in ensuring that the transaction will not be executed repeatedly, so it may be necessary to introduce checkpoints to record the transaction id, which I haven't thought about yet.
@BenJFan Recently, I haven't started to write code, mainly to understand how to ensure strict consistency of transactions when using Flink CDC: Can Flink CDC guarantee MySQL transactions In addition, for the specific code level, I still have the above three problems to be solved.
Guaranteed transaction is not only one component that can be completed, the transaction that cdc can guarantee requires not only that the data source can be replayed (binlog can be replayed), but also the sink side to support transactions (traditional transaction or distributed transaction) or write idempotency
Flink CDC supports binlog replay. The problem I want to solve is that the sink side can strictly guarantee the transactions on the source side, rather than simply inserting and modifying them line by line through SQL ( it only replays SQL, but it can not guarantee transactions. For example, if a sink side transaction suddenly goes down in the middle of execution, there is a problem with the data on the sink side at this time). I think there are several key points to this problem:
- The source side can obtain transaction information and ensure the order
- The sink side can ensure the sequential insertion of transactions and the idempotency during fault recovery
I have some understanding of these two questions:
- I found that the changelog event in debezium contains transaction information, but the transaction information in Flink's SourceRecord is not complete. I'm considering whether to improve the transaction information of Flink CDC, then construct different queues through different transaction id, and finally submit in the order of gtids?
- The idempotency of fault recovery is mainly reflected in ensuring that the transaction will not be executed repeatedly, so it may be necessary to introduce checkpoints to record the transaction id, which I haven't thought about yet.
- The order of transactions is determined by the transaction id. Idempotency needs to be supported by the design of data writing methods, and has nothing to do with fault recovery.
- CDC should already support checkpoint.
The combination of Flink CDC and Flink JDBC has achieved idempotency. There are checkpoints on the Source side and upsert on the Sink side, however, this combination can only meet the final consistency, but can not meet the real-time consistency, (as I said above, Flink CDC and Flink JDBC will split the operations in a transaction into many SQL). The transaction order and checkpoint I mentioned here refer to the implementation under the condition of ensuring transactions.
If we just want to ensure the final consistency, I don't think it's difficult to realize data synchronization. However, if strict consistency is required, there will be transaction problems, which depends on the transaction information provided by Flink CDC. However, at present, the transaction information is incomplete. I'm not sure to what extent we want to achieve, maybe we just need to ensure the final consistency.
@BenJFan Recently, I haven't started to write code, mainly to understand how to ensure strict consistency of transactions when using Flink CDC: Can Flink CDC guarantee MySQL transactions In addition, for the specific code level, I still have the above three problems to be solved.
Guaranteed transaction is not only one component that can be completed, the transaction that cdc can guarantee requires not only that the data source can be replayed (binlog can be replayed), but also the sink side to support transactions (traditional transaction or distributed transaction) or write idempotency
Flink CDC supports binlog replay. The problem I want to solve is that the sink side can strictly guarantee the transactions on the source side, rather than simply inserting and modifying them line by line through SQL ( it only replays SQL, but it can not guarantee transactions. For example, if a sink side transaction suddenly goes down in the middle of execution, there is a problem with the data on the sink side at this time). I think there are several key points to this problem:
- The source side can obtain transaction information and ensure the order
- The sink side can ensure the sequential insertion of transactions and the idempotency during fault recovery
I have some understanding of these two questions:
- I found that the changelog event in debezium contains transaction information, but the transaction information in Flink's SourceRecord is not complete. I'm considering whether to improve the transaction information of Flink CDC, then construct different queues through different transaction id, and finally submit in the order of gtids?
- The idempotency of fault recovery is mainly reflected in ensuring that the transaction will not be executed repeatedly, so it may be necessary to introduce checkpoints to record the transaction id, which I haven't thought about yet.
- The order of transactions is determined by the transaction id. Idempotency needs to be supported by the design of data writing methods, and has nothing to do with fault recovery.
- CDC should already support checkpoint.
The combination of Flink CDC and Flink JDBC has achieved idempotency. There are checkpoints on the Source side and upsert on the Sink side, however, this combination can only meet the final consistency, but can not meet the real-time consistency, (as I said above, Flink CDC and Flink JDBC will split the operations in a transaction into many SQL). The transaction order and checkpoint I mentioned here refer to the implementation under the condition of ensuring transactions.
If we just want to ensure the final consistency, I don't think it's difficult to realize data synchronization. However, if strict consistency is required, there will be transaction problems, which depends on the transaction information provided by Flink CDC. However, at present, the transaction information is incomplete. I'm not sure to what extent we want to achieve, maybe we just need to ensure the final consistency.
In my opinon,step by step. The fisrt thing is support cdc, then consider about consistency
@BenJFan I still have these questions that I hope can be answered: https://github.com/apache/incubator-seatunnel/issues/1461#issuecomment-1064156814
Hi, we have similar question for this. If seatunnel support flink cdc, there should be a flink cdc source plugin, is it in the plan of the roadmap?
Search before asking
Description
Having been following the Seatunnel project for a long time, I am very interested in this project and have downloaded and tried this project. Unfortunately, due to the unfamiliarity with Fink and Java, my attempt to implement Flink CDC for this project encountered some problems. After spending more effort on study, I have come up with a few ideas on the implementation of CDC. The following is some of my ideas I would like to discuss with you and hope you could give some suggestions:
Seatunnel currently supports two plug-in systems: Flink and Spark. However, there is still room for improvement in the existing plug-in system's support for CDC. I suggest adding CDC system for the following reasons:
So I think the new CDC system consists of two phases: Source and Sink. Source outputs DataStream; Sink receives DataStream, parses the fields, and modifies the target database through Flink JDBC.
Taking MySQL synchronization to MySQL as an example, the key points of data processing are as follows:
Remaining problems:
The above is my simple design of seatunnel supporting CDC. I'm really not familiar with Flink (I've only studied it for about one week). I hope you can give me more suggestions, Thank you very much!
Usage Scenario
No response
Related issues
No response
Are you willing to submit a PR?
Code of Conduct