apache / seatunnel

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
https://seatunnel.apache.org/
Apache License 2.0
7.86k stars 1.77k forks source link

Mysql cdc duplicate synced data #6725

Open HSLife1991 opened 5 months ago

HSLife1991 commented 5 months ago

sync mysql data to hive by mysql cdc connector. 1.initial synced all the data and it's right in hive table; 2.changed some data in original mysql table or remove some records; 3.the dest hive contains duplicate record when change the mysql existed data;

liunaijie commented 5 months ago

mysql cdc default format will generate 2 record when upstream data updated, one record is delete one record is insert. maybe this is the reason why your data is duplicated. and if you change the format to compatible_debezium_json, it will only generate one update record. You can change the sink to Console then to check the result.

For your case, you use hive as destination, hive is not support update, delete operation. also cdc will generate a lots of small file. maybe it's not a good idea.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.