GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.15k stars 971 forks source link

How to achieve schema changes for the mysql to bq cdc. Any ideas to do that seamlessly without redeploying it. #97

Closed asvseb closed 4 months ago

asvseb commented 4 years ago

How to achieve schema changes for the mysql to bq cdc. Any ideas to do that seamlessly without redeploying it.

pabloem commented 4 years ago

Unfortunately, currently this is not possible. In the future, changes are planned to support schema changes more seamlessly (e.g. json-based data encoding, and the Dataflow pipeline will live-lookup schemas).

duizendstra commented 4 years ago

I am working on a pipeline that does a live lookup of the schema. Is it possible to see the planned changes/approach to see if I can learn/or contribute?

pabloem commented 4 years ago

I have not yet quite figured this out yet. A possibility is to have the data encoded in JSON format, use a single PubSub topic, and periodically update the schemas in data catalog. On the merge side and the changelog writing side, whenever we see a schema update, then we update the BQ table.

ekanban commented 4 years ago

Can someone please explain to me how exactly does Data Catalog fit into this and how are we using it here currently? Also can you please take a look at this similar issue where I've asked a couple of questions?

@asvseb @duizendstra @pabloem

gkstechie commented 4 years ago

@Ace-Bansal This projet uses custom data catalog entry in single topic mode, which GCP didn't give support for update/edit entry so we can't update schema in that mode. When I try with delete and create again resource (entry with same name) was not available as it requires some time to free up the resources.

In case of multi topic mode, this project uses pubsub entry which can be updated and GCP supports API for those operations. so datacatalog will be fit only in multi topic mode as of my knowledge for handling schema changes.

In dbz with kafka connect solution, confluent uses schema registry which handles schema changes (register new schema etc.,) more over with avro serialization format we can combine schema and data so that we can avoid data catalog if we can implement that feature in this project.

Other than that we may need to identify ways to compare old and new schema or to get new updated columns or data types so that we can update BQ table schema.

Please let me know your suggestion, as I would also like to contribute to this project @pabloem

asvseb commented 4 years ago

Hi, The schema changes is possible with the approach mentioned by pabloem on Apr 24 using JSON encoding.

gkstechie commented 4 years ago

Hi, In multi-topic mode it is possible (as it is using pubsub entry based catalog) but in single topic mode it is not possible because this solution uses custom catalog as google not implemented update API for custom catalog. Even though data encoded in JSON schema can't be updated in the custom catalog.

@asvseb @pabloem

krattan406 commented 3 years ago

Hi, I am also facing the issue of schema changes not being supported. I am using debezium embedded connector for mysql to stream data into bigquery with single topic. Whenever there is a schema change the connector has to be redeployed. Please let me know if anyone found a solution to this issue or have any ideas how to resolve having to redeploy the connector. Converting to JSON is not feasible. Thanks!

@pabloem @gkstechie

pabloem commented 3 years ago

unfortunately I don't have time to spend working on this. It's a difficult problem to achieve with Dataflow, as coders need to be updated, and BQ schemas aslsow ould need to be updated on the fly.

github-actions[bot] commented 5 months ago

This issue has been marked as stale due to 180 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the issue at any time. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.