Closed asvseb closed 4 months ago
Unfortunately, currently this is not possible. In the future, changes are planned to support schema changes more seamlessly (e.g. json-based data encoding, and the Dataflow pipeline will live-lookup schemas).
I am working on a pipeline that does a live lookup of the schema. Is it possible to see the planned changes/approach to see if I can learn/or contribute?
I have not yet quite figured this out yet. A possibility is to have the data encoded in JSON format, use a single PubSub topic, and periodically update the schemas in data catalog. On the merge side and the changelog writing side, whenever we see a schema update, then we update the BQ table.
Can someone please explain to me how exactly does Data Catalog fit into this and how are we using it here currently? Also can you please take a look at this similar issue where I've asked a couple of questions?
@asvseb @duizendstra @pabloem
@Ace-Bansal This projet uses custom data catalog entry in single topic mode, which GCP didn't give support for update/edit entry so we can't update schema in that mode. When I try with delete and create again resource (entry with same name) was not available as it requires some time to free up the resources.
In case of multi topic mode, this project uses pubsub entry which can be updated and GCP supports API for those operations. so datacatalog will be fit only in multi topic mode as of my knowledge for handling schema changes.
In dbz with kafka connect solution, confluent uses schema registry which handles schema changes (register new schema etc.,) more over with avro serialization format we can combine schema and data so that we can avoid data catalog if we can implement that feature in this project.
Other than that we may need to identify ways to compare old and new schema or to get new updated columns or data types so that we can update BQ table schema.
Please let me know your suggestion, as I would also like to contribute to this project @pabloem
Hi, The schema changes is possible with the approach mentioned by pabloem on Apr 24 using JSON encoding.
Hi, In multi-topic mode it is possible (as it is using pubsub entry based catalog) but in single topic mode it is not possible because this solution uses custom catalog as google not implemented update API for custom catalog. Even though data encoded in JSON schema can't be updated in the custom catalog.
@asvseb @pabloem
Hi, I am also facing the issue of schema changes not being supported. I am using debezium embedded connector for mysql to stream data into bigquery with single topic. Whenever there is a schema change the connector has to be redeployed. Please let me know if anyone found a solution to this issue or have any ideas how to resolve having to redeploy the connector. Converting to JSON is not feasible. Thanks!
@pabloem @gkstechie
unfortunately I don't have time to spend working on this. It's a difficult problem to achieve with Dataflow, as coders need to be updated, and BQ schemas aslsow ould need to be updated on the fly.
This issue has been marked as stale due to 180 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the issue at any time. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
How to achieve schema changes for the mysql to bq cdc. Any ideas to do that seamlessly without redeploying it.