GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.14k stars 954 forks source link

cdc-embedded-connector doesn't push CDC changes on Pub/Sub #175

Closed karan-kaushik-searce closed 3 months ago

karan-kaushik-searce commented 3 years ago

We are deploying the below solution to sink data from MySQL to Bigquery using CDC. cdc-embedded-connector

Connector properties: databaseName=testdb databaseUsername=root databaseAddress=localhost databasePort=3306 gcpProject=GCP_project_name databasePassword=password whitelistedTables=instance-name.testdb.testtab singleTopicMode=true gcpPubsubTopicPrefix=debeziumTest databaseManagementSystem=mysql

The topic is already created in Pub/Sub with the name "debeziumTest".

When I ran below Maven command, it starts without any error but no CDC changes are pushed to Pub/Sub. Also, no log file is written. sudo mvn exec:java -pl cdc-embedded-connector -Dexec.args="/path/to/properties-file"

debezium connector issue

Spince commented 3 years ago

@karan-kaushik-searce do you see an entry group created in data catalog?

karan-kaushik-searce commented 3 years ago

@Spince - Yes it creates an entry-group in the data catalog.

Spince commented 3 years ago

@karan-kaushik-searce I received failure messages until I had the right permissions for each of the GCP services used. are the tables you're monitoring empty? have you started the connector and inserted/updated/deleted any rows?

karan-kaushik-searce commented 3 years ago

@Spince - GCP services permissions are given to VM. The table is not empty. We start connector and insert records; CDC changes are also captured by connector but it fails to push messages in Pub/Sub. We did this observation by enabling debug logs.

I will enable logging for trace and will share logs with you.

shanemeister commented 3 years ago

Any resolution on this?

chetansharmagithub commented 3 years ago

karan-kaushik-searce You've used whitelistedTables=instance-name.testdb.testtab. Are you sure that your MySQL instance name is 'instance-name' ?

karunakarv commented 3 years ago

how can i get mysql instance name. I have installed mysql on linux vm. Will be it same as my linux host name or any command available to find it?

chetansharmagithub commented 3 years ago

@karunakarv Even I do not know it yet. I've tried with some random instance names like mysql, localhost, host-server-name etc. but none of them worked.

chetansharmagithub commented 3 years ago

@karan-kaushik-searce @karunakarv Here is an update: I've got it working by using my database name as the instance name. For example, if my database name is rdw and table name is people, then my whitelistedTables string for this table will be rdw.rdw.people I hope this helps everyone here.

karunakarv commented 3 years ago

@chetansharmagithub After changing, whitelisted table name, I am getting new error. "ERROR common.DataCatalogSchemaUtils: Entry group name: "

chetansharmagithub commented 3 years ago

@karunakarv I was also facing similar error earlier. This ERROR is probably caused by using wrong name for the topic. Actually, the documentation guide currently given on github is not completely correct. What you should use rather as the topic name is the value you've specified for the 'gcpPubsubTopicPrefix' key in dataflow_cdc-properties file. For example, if

gcpPubsubTopicPrefix=exportdemo

then, pubsub topic name should be 'exportdemo' on GCP console.

I hope this solves your problem too.

mariana-vc commented 3 years ago

I'm having the same error as @karunakarv ERROR common.DataCatalogSchemaUtils: Entry group name: Topic name at GCP=>topic_name gcpPubsubTopicPrefix=topic_name

chetansharmagithub commented 3 years ago

@mariana-vc There is a catch for topic name on GCP. Let me clarify it deeply:

On running cdc-embedded-connector in singleTopicMode=true, Changes in all the whitelisted tables are pushed onto single topic. And, the topic name should be same as the value of gcpPubSubPrefix specified in dataflow_cdc.properties

On running in multi-topic mode i.e. singleTopicMode=false (or not using this property, as default value is false), Multiple topics should be created on GCP. And, changes in each whitelisted table are pushed onto corresponding pub/sub topic. And, the topics names should be a combination of the value of gcpPubSubPrefix specified in dataflow_cdc.properties and the fully qualified table name. For example, if pubSubTopicPrefix=topicname And, fully qualified table name is instance_name.db_name.people Then, topic name on GCP should be named topic_name_instance_name.db_name.people And, similarly separate topics for all the tables should be created.

I hope this helps.

hugosoftdev commented 3 years ago

Any update on this? @chetansharmagithub did you make it work? If so, can you share your config pls? (with fake infos, no worries lol)

chetansharmagithub commented 3 years ago

@hugosoftdev It was working with the details I had shared in the comments above. Please follow them. After that, if you will still face any issues, which I don't think will be the case, feel free to share the problem/error you're stuck at.

github-actions[bot] commented 3 months ago

This issue has been marked as stale due to 180 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the issue at any time. Thank you for your contributions.

github-actions[bot] commented 3 months ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.