ConduitIO / conduit

Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
https://conduit.io
Apache License 2.0
399 stars 49 forks source link

Cassandra source connector #989

Open maha-hajja opened 1 year ago

maha-hajja commented 1 year ago

Feature description

add a source connector to https://github.com/conduitio-labs/conduit-connector-cassandra

alarbada commented 3 months ago

Here's a brief summary on approaches to do this connector

Using CDC mode

The most reliable way to stream database changes is to use the cdc mode: https://cassandra.apache.org/doc/stable/cassandra/operating/cdc.html

which works by adding a cdc=true to the table like as follows:

CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true;

That will output data changes into files inside the configured cdc directory, specified at a cassandra.yaml. So, using this method of getting changes means that we need to create a custom program that listens to file changes at the cdc dir. Here are some implementations:

debezium captures cassandra events via a single JVM process inside each cassandra process and publishes them to Kafka.

same as debezium. Consists of 2 components:

There are no database triggers involved, and that's what's recommended from the cassandra docs. On the other hand, we now have a distributed source connector, which adds a considerable amount of complexity.

Using polling

Another vastly simpler approach to capture cassandra events would be to fetch the given tables every x amount of time for new changes, filtering results via a last_updated column.