apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.76k stars 4.21k forks source link

[Feature Request] : Is it possible for apache beam mongo sdk to include support to read Mongodb cdc (Change data capture) streams ? #23551

Open Rstar1998 opened 1 year ago

Rstar1998 commented 1 year ago

What would you like to happen?

Hi folks , There are many use cases where people want realtime migration of collections of Mongodb to Big query (Data Warehouse ) / GCS (Data lake). Hence , it would be great if apache beam mongo sdk included support to read Mongodb cdc (Change data capture) streams in real time and those events capture can be dumped into big query or GCS for analytical purposes. Currently to achieve this stuff we need a lot complex architecture in GCP including debezium , kafka etc. If this feature is implemented properly there will be only one single pipeline which will do all the stuff.

Issue Priority

Priority: 3

Issue Component

Component: io-py-mongodb

xinbinhuang commented 1 year ago

No an expert on Beam's support on this. But have you tried the debezium io? https://beam.apache.org/releases/javadoc/2.30.0/org/apache/beam/io/debezium/DebeziumIO.html?

Rstar1998 commented 1 year ago

@xinbinhuang You are right. But a managed service (wrt to compute and scaling) is always preferred by most of the customers rather than hosting debezium on a server. If we are able to make mongo streaming possible using dataflow , it will help in lot of data engineering usecases

xinbinhuang commented 1 year ago

I'm not sure if I follow your questions. I believe the DebeziumIO is self-contained and encapsulated within Beam, so you don't need to set up a separate kafka and debezium cluster. Though experimental, it should also work with the MongoDbConnector.

If we are able to make mongo streaming possible using dataflow , it will help in lot of data engineering usecases

Is your use case about Beam or about dataflow? Beam should just work on dataflow.

Rstar1998 commented 1 year ago

@xinbinhuang DebeziumIO should work in my case. But unfortunately it doesn't have Mongo support in beam library itself.

https://beam.apache.org/releases/javadoc/2.30.0/org/apache/beam/io/debezium/DebeziumIO.html

https://github.com/apache/beam/blob/master/sdks/java/io/debezium/src/main/java/org/apache/beam/io/debezium/Connectors.java

Maybe some tweaks will able to do our job. Because debezium does have Mongo connector support.

@benWize is it possible to include mongo connector in debezium section of apache beam ?

https://github.com/apache/beam/blob/4ffeae4d2b800f2df36d2ea2eab549f2204d5691/sdks/java/io/debezium/src/main/java/org/apache/beam/io/debezium/Connectors.java