[Feature][CDCSOURCE] source with kafka debezium json format

ysmintor commented 8 months ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

目前存量多源数据库总计有几千张表，数量非常多，有对接的 Kafka 采用了CDC 的方式采集增量数据，大部分格式直接是 debezium json 格式，但由于表数量大，一个 Kafka topic 里会有数量不等的表。没有权限直接对接几千个业务库，而且也不是 MySQL，看Dinky 官方给的都是 MySQLCDC，还有 OracleCDC等。

目前要从 Kafka 消费来实现整库同步，一个topic 会有多张表，这种 Kafka source with debezium json format 希望能够作为一个数据源加入。

English translation

Currently, the existing multi-source database has thousands of tables in total, a huge number. The connected Kafka uses the CDC method to collect incremental data, and most of the formats are in debezium json format, but due to the large number of tables, a Kafka topic will have an unequal number of tables. There is no permission to directly connect to thousands of business libraries, and it is not MySQL. Dinky's official documents are all MySQLCDC and OracleCDC.

Currently, in order to implement full-database synchronization from Kafka consumption, a topic will have multiple tables. This Kafka source with debezium json format is expected to be added as a data source.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

github-actions[bot] commented 8 months ago

Hello @ysmintor, this issue is about CDC/CDCSOURCE, so I assign it to @aiwenmo. If you have any questions, you can comment and reply.

你好 @ysmintor, 这个 issue 是关于 CDC/CDCSOURCE 的，所以我把它分配给了 @aiwenmo。如有任何问题，可以评论回复。

Zzm0809 commented 8 months ago

直接使用 kafka 连接器即可本身都是 json

aiwenmo commented 8 months ago

Is your requirement to split the data and write it to different tables?

ysmintor commented 8 months ago

Is your requirement to split the data and write it to different tables?

@aiwenmo Yes. One Kafka topic may have multiple cdc tables. And need to write into different tables. I also think we can conusme multile Kafka topics corresponding one table case.

ysmintor commented 7 months ago

@aiwenmo @Zzm0809

I have actually take a look of Flink CDC and Hudi solutions. But it seems a bit hard to implement a connector from Kafka CDC (somethings I called it as debezium json in Kafka) to Hudi or other databases with my team group.

Recently I take some time to practice with Apache Paimon CDC ingestion of Kafka CDC, after that I thought it might a solution for us, as Apache Paimon serveral days ago became a Top Project of Apache graduated from incubation. So I wonder whether you can implement this Kafka CDC source connector or absorbe their implementation of KafkaSyncDatabaseAction and KafkaSyncTableAction or just wrap it into a CDCSOURCE task on Dinky.

I know these features may cause a bit code and structure changes, and please at your schedule to think that.

Zzm0809 commented 7 months ago

@aiwenmo @Zzm0809

I have actually take a look of Flink CDC and Hudi solutions. But it seems a bit hard to implement a connector from Kafka CDC (somethings I called it as debezium json in Kafka) to Hudi or other databases with my team group.我实际上已经了解了 Flink CDC 和 Hudi 解决方案。但与我的团队一起实现从 Kafka CDC（我在 Kafka 中将其称为 Debezium json）到 Hudi 或其他数据库的连接器似乎有点困难。

Recently I take some time to practice with Apache Paimon CDC ingestion of Kafka CDC, after that I thought it might a solution for us, as Apache Paimon serveral days ago became a Top Project of Apache graduated from incubation. So I wonder whether you can implement this Kafka CDC source connector or absorbe their implementation of KafkaSyncDatabaseAction 和 KafkaSyncTableAction or just wrap it into a CDCSOURCE task on Dinky.最近我花了一些时间练习 Apache Paimon CDC 对 Kafka CDC 的摄取，之后我认为这可能是我们的一个解决方案，因为 Apache Paimon 几天前已经成为 Apache 孵化的顶级项目。所以我想知道您是否可以实现这个 Kafka CDC 源连接器或吸收他们的 KafkaSyncDatabaseAction 和 KafkaSyncTableAction 实现，或者只是将其包装到 Dinky 上的 CDCSOURCE 任务中。

I know these features may cause a bit code and structure changes, and please at your schedule to think that.我知道这些功能可能会导致代码和结构发生一些变化，请在您的日程安排中考虑这一点。

Do you have the energy to fulfill this requirement?

ysmintor commented 7 months ago

@aiwenmo @Zzm0809 I have actually take a look of Flink CDC and Hudi solutions. But it seems a bit hard to implement a connector from Kafka CDC (somethings I called it as debezium json in Kafka) to Hudi or other databases with my team group.我实际上已经了解了 Flink CDC 和 Hudi 解决方案。但与我的团队一起实现从 Kafka CDC（我在 Kafka 中将其称为 Debezium json）到 Hudi 或其他数据库的连接器似乎有点困难。 Recently I take some time to practice with Apache Paimon CDC ingestion of Kafka CDC, after that I thought it might a solution for us, as Apache Paimon serveral days ago became a Top Project of Apache graduated from incubation. So I wonder whether you can implement this Kafka CDC source connector or absorbe their implementation of KafkaSyncDatabaseAction 和 KafkaSyncTableAction or just wrap it into a CDCSOURCE task on Dinky.最近我花了一些时间练习 Apache Paimon CDC 对 Kafka CDC 的摄取，之后我认为这可能是我们的一个解决方案，因为 Apache Paimon 几天前已经成为 Apache 孵化的顶级项目。所以我想知道您是否可以实现这个 Kafka CDC 源连接器或吸收他们的 KafkaSyncDatabaseAction 和 KafkaSyncTableAction 实现，或者只是将其包装到 Dinky 上的 CDCSOURCE 任务中。 I know these features may cause a bit code and structure changes, and please at your schedule to think that.我知道这些功能可能会导致代码和结构发生一些变化，请在您的日程安排中考虑这一点。

Do you have the energy to fulfill this requirement?

Sorry, I do not have resources to implement this feature.

aiwenmo commented 7 months ago

I am willing to submit a PR.

medivh511 commented 6 months ago

@aiwenmo @Zzm0809 I have actually take a look of Flink CDC and Hudi solutions. But it seems a bit hard to implement a connector from Kafka CDC (somethings I called it as debezium json in Kafka) to Hudi or other databases with my team group.我实际上已经了解了 Flink CDC 和 Hudi 解决方案。但与我的团队一起实现从 Kafka CDC（我在 Kafka 中将其称为 Debezium json）到 Hudi 或其他数据库的连接器似乎有点困难。 Recently I take some time to practice with Apache Paimon CDC ingestion of Kafka CDC, after that I thought it might a solution for us, as Apache Paimon serveral days ago became a Top Project of Apache graduated from incubation. So I wonder whether you can implement this Kafka CDC source connector or absorbe their implementation of KafkaSyncDatabaseAction 和 KafkaSyncTableAction or just wrap it into a CDCSOURCE task on Dinky.最近我花了一些时间练习 Apache Paimon CDC 对 Kafka CDC 的摄取，之后我认为这可能是我们的一个解决方案，因为 Apache Paimon 几天前已经成为 Apache 孵化的顶级项目。所以我想知道您是否可以实现这个 Kafka CDC 源连接器或吸收他们的 KafkaSyncDatabaseAction 和 KafkaSyncTableAction 实现，或者只是将其包装到 Dinky 上的 CDCSOURCE 任务中。 I know these features may cause a bit code and structure changes, and please at your schedule to think that.我知道这些功能可能会导致代码和结构发生一些变化，请在您的日程安排中考虑这一点。

Do you have the energy to fulfill this requirement?

Sorry, I do not have resources to implement this feature.

只需要使用dinky flink jar的方法调用 paimon action 来做kafka cdc, 无需做单独的source的；但目前的问题事paimon 0.8的版本依然没解决debezium json的自动寻找主键的问题，debezium json的格式是不带主键信息的，无法识别主键，paimon无法自动建表，从paimon的issue看，有人实现了从kafka connect 里生成的key去取主键，暂时没有merge到master的branch里，需要等待0.9版本了；但目前dinky的datastream-kafka生成到kafka信息只有value，没有key，这也不会被paimon cdc所识别

github-actions[bot] commented 5 months ago

Hello @, this issue has not been active for more than 30 days. This issue will be closed in 7 days if there is no response. If you have any questions, you can comment and reply.

你好 @, 这个 issue 30 天内没有活跃，7 天后将关闭，如需回复，可以评论回复。

DataLinkDC / dinky