apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.19k stars 864 forks source link

[Feature] support synchronize multiple table into corresponding paimon table for MySqlSyncTableAction #1248

Open zhangjun0x01 opened 1 year ago

zhangjun0x01 commented 1 year ago

Search before asking

Motivation

Now, users can synchronize one or multiple tables from MySQL into one Paimon table. I think it is necessary to synchronize multiple MySQL tables to the corresponding paimon table

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

tsreaper commented 1 year ago

Hi @zhangjun0x01 , thanks for opening this issue.

However I can't quite understand what you mean. What's the difference between "synchronize one or multiple tables from MySQL into one Paimon table" and "synchronize multiple MySQL tables to the corresponding paimon table"? To me they are the same.

zhangjun0x01 commented 1 year ago

hi, @tsreaper

"synchronize one or multiple tables from MySQL into one Paimon table"

mysql : db1.t1 + db2.t2 --> paimon : db. t3

"synchronize multiple MySQL tables to the corresponding paimon table"

mysql : db1.t1 + db2.t2 --> paimon : db1.t1 + db2.t2

the first case : we can synchronize multiple mysql table to one paimon table , build a wide-table on paimon.

the second case : There is no relationship between the MySQL tables, so I cannot merge them into one paimon table. I want to use one Flink job to synchronize all MySQL tables to the corresponding paimon table, instead of synchronizing one table for each Flink job, so that reduce resource consumption.

qidian99 commented 1 year ago

Yes this feature is applicable in some scenarios where the user physically splits the table into multiple tables (either vertically or horizontally), and tries to merge them into one Paimon table.

Some additionally fields are needed to deal with primary key conflicts and data lineage.

gfunc commented 11 months ago

This also applies to CDC Kafka, IMO it is better if we assign partition keys with values extracted from Kafka meta (topic etc.)/canal meta(database etc.) to avoid pk conflicts.