apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.8k stars 1.75k forks source link

[Feature]Support CDC #2394

Closed CalvinKirs closed 1 year ago

CalvinKirs commented 2 years ago

Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.

CDC is mainly divided into two ways: query-based and Binlog-based. We know that MySQL has binlog (binary log) to record the user's changes to the database, so it is logical that one of the simplest and most efficient CDC implementations can be done using binlog. Of course, there are already many open source MySQL CDC implementations that work out of the box. Using binlog is not the only way to implement CDC (at least for MySQL), even database triggers can perform similar functions, but they may be dwarfed in terms of efficiency and impact on the database.

Typically, after a CDC captures changes to a database, it will publish the change events to a message queue for consumers to consume, such as Debezium, which persists MySQL (and also supports PostgreSQL, Mongo, etc.) changes to Kafka, and by subscribing to the events in Kafka, we can get the content of the changes and implement the functionality we need.

And as data synchronization, I think we need to support CDC as a feature, and I want to hear from you all how you think it can be implemented in SeaTunnel.

2013650523 commented 2 years ago

I'm going to use DeBezium to get the incremental real-time bin log, pass it to the API of the new Connector, save the source status and read back。

CalvinKirs commented 2 years ago

I'm going to use DeBezium to get the incremental real-time bin log, pass it to the API of the new Connector, save the source status and read back。

Any detailed designs can be put here, we'll discuss them first.

guanboo commented 2 years ago

We can implement this based on FlinkCDC, configure through conf, and dynamically generate DDL SQL for Source and Sink.

TyrantLucifer commented 2 years ago

What is the community's plan to achieve CDC without relying on any framework such as flink with pure java?

CalvinKirs commented 2 years ago

We can implement this based on FlinkCDC, configure through conf, and dynamically generate DDL SQL for Source and Sink.

FlinkCDC base on Flink, And our connector needs to be independent of the engine

ashulin commented 2 years ago

We can implement it based on Debezium and Netflix's DBLog parallel algorithm

melin commented 1 year ago

Implement CDC data synchronization hudi based on Debezium Server

https://github.com/apache/hudi/issues/6853

@CalvinKirs

ashulin commented 1 year ago

We can implement it based on Debezium and Netflix's DBLog parallel algorithm

  1. Supports multi-table and sharding (Easy Configuration)
  2. Supports parallel reading of historical data (Fast synchronization, billions of large table)
  3. Supports reading incremental data (CDC)
  4. Support heartbeat detection (metrics, small traffic table)
  5. Support for dynamically adding new tables (Easier to operate and maintain)
  6. Support Schema evolution(DDL)
cason0126 commented 1 year ago

I have several opinions on the necessity and realization of CDC:

Necessity: As a new generation data integration platform, CDC is very necessary; Because in actual use, the demand for change flow capture in the enterprise is gradually increasing. The rapid development of Flink CDC is a good example. --If Sea Tunnel does not strengthen its ability to process CDC, it will be faced with the problem of using additional CDC processing tools after using ST to process batch data. Like Canal, Debezium, FlinkCDC.

In addition, the change flow capture at the bottom of FlinkCDC also uses Debezium

Several implementation concerns: ---Engine > Connector. Common databases with many requirements, such as MySQL and Oracle. Their CDC implementation schemes are roughly the same, all based on Binlog. For example: Flink Oracle CDCs use Debezium for CDC content collection, while Debezium uses a Logmienr based solution. StreamSets' processing of Oracle is also based on Logminer. Therefore, the priority of CDC content collection should be lowered, and the design of the processing engine should be considered first. This may include the unified process of CDC processing, such as consistency, breakpoint, batch flow connection, fault tolerance, failover, and other issues that should be handled uniformly. In this part, FlinkCDC is worthy of reference.

Sinse Sea Tunnel 2.3.0, it has its own computing engine, which is the cornerstone of processing CDC (most of the time, when the outlet of the CDC stream is a single thread, the processing does not need to be distributed, so it does not rely on computing engines such as Flink or Spark).

---Format compatibility. In most cases, when Sea Tunnel has CDC processing capabilities, it will need to process messages sent to Kafka from other CDC tools. Therefore, compatibility with some common formats, such as Flink CDC, should also be considered. In other words, when designing its format, Sea Tunnel should be designed independently to ensure rapid development or compatible with common component formats in the market, which is also worth considering.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 1 year ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.