cdc: add option to strip out metadata for pubsub

amruss commented 2 years ago

Today with our gc pubsbu integration you need to use ETL tools like cloud dataflow in order to get the json changefeed messages to format correctly for using BigQuery.

As there is a large contingency of users who want to use gc pubsub + bigquery, and it makes sense for us to give an option to format the data in a way that makes it as easy as possible to send data from gc pubsub to bigquery (aka no metadata, only the columns that correspond to the columns in the table).

envelope=row does not work with pubsub

Jira issue: CRDB-16281

Epic CRDB-37095

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/cdc

carlsongould commented 2 years ago

In trying to I am trying to get CRDB(CDC)-> Pub/Sub -> Dataflow -> BigQuery I have encountered an issue. BigQuery does not support maps or dictionaries in JSON, due to potential lack of schema information in a pure JSON dictionary. It essentially has really basic json reading capabilities, which makes sense as it emulates an rdbms.

Steps to reproduce: Currently CRDB CDC outputs to Google Pub/Sub and a topic is created (I did not explicitly indicate the topic name) using command: CREATE CHANGEFEED FOR TABLE pubsubdb.accounts INTO 'gcpubsub://cockroach-jeffcarlson?region=us-east1&AUTH=specified&CREDENTIALS={credentials}

To validate that the pub/sub data was being received, I used Google cloud storage to subscribe to the pub/sub. This validated the formatting of the data being received and that it was indeed working as expected.

CRDB CDC Sends data in the following format to pub/sub: {"key":[94],"value":{"after":{"balance":5.21,"id":94}},"topic":"accounts"}

Also, tried outputting to Kafka : CRDB ouput to kafka {"after": {"id": 1, "name": "Petee"}} {"after": {"id": 2, "name": "Carl"}} {"after": {"id": 9, "name": "Keith"}} {"after": {"id": 10, "name": "harsh"}}

In loading data to big query, it wanted the data in the following format: JSON that loaded to the table correctly for multiple fields {"id": 13, "name": "Plan B"} {"id": 14, "name": "Lanky"}

Dataflow is the primary tool for moving/streaming data for things like BigQuery and more. Dataflow creates pipelines from pub/sub. In testing this, it worked as expected, the big issue ultimately is the format of the Json being received by BigQuery.

HonoreDB commented 2 years ago

Does WITH envelope='row' help with this?

carlsongould commented 2 years ago

Great idea so I tested it for both kafka and the pubsub connector.

For kafka that produces the following output for my table named 't' which contains three columns that are int, varchar, boolean: {"rowid": 767553866843324417, "test1": 1, "test2": "hello", "test3": true} {"rowid": 767560237475299329, "test1": 1, "test2": "hello", "test3": true} It also outputs the row id, which may be problematic but easily addressed with adding rowid to bigquery. However, when I apply the WITH to the pubsub connector I receive the error. I used command: CREATE CHANGEFEED FOR TABLE mytest.t INTO 'gcpubsub://cockroach-jeffcarlson?region=us-east1&AUTH=specified&CREDENTIALS={credentials}' WITH envelope='row'; The output was: ERROR: this sink is incompatible with envelope=row I also tested with the optional parameter 'topic_name in case it became required but received the same error. Seems the option is not available for pubsub feature as of yet.

miretskiy commented 1 year ago

@amruss do expressions help here? Emit whatever you want to emit? If so, can you close this issue?

miretskiy commented 1 year ago

Lets figure out if CDC query helps here, better yet our answer might be to use "parquet".

cockroachdb / cockroach

cdc: add option to strip out metadata for pubsub #82307