citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.43k stars 662 forks source link

NATS CDC and citus #6665

Open gedw99 opened 1 year ago

gedw99 commented 1 year ago

same as here but with NATS.

https://github.com/citusdata/citus/issues/44

NATS is like kafka but is golang and uses less resources than kafka. I use NATS for CDC now and would be cool if citus integrated with NATS

pinodeca commented 1 year ago

Leaving aside the challenge of getting a change stream for a distributed table - is any Citus-specific integration needed compared to PG-NATS integration? I guess it's mostly about automatically adding new Citus nodes to the data stream?

rajeshkt78 commented 1 year ago

I haven't tried NATS yet(but assuming it is similar to Kafka), but from Citus CDC perspective, the changes needed are: 1) Avoiding duplication of events publication during internal operations like create_distributed_table/shard splits/moves and rebalacing etc. 2) Doing transalation of "shard" tables to "distributed" tables and handling schema changes as needed.

These changes are being implemented in this PR: https://github.com/citusdata/citus/pull/6623

gedw99 commented 1 year ago

@rajeshkt78 This PR looks excellent !!

yep NATS is very like kafka.

C Client for it. https://github.com/nats-io/nats.c

NATS Server is able to be clustered and then put a Gateway in front.

It's all baked into the nats server.

Perf is excellent. Even though it's written in golang its been designed to avoid garbage collection ( as much as possible ) by using the Stack and not the Heap.

It can run on OS and Chip ISA architecture. WASM, IOT, desktops, mobile etc etc

Its has a control plane and a data plane. A Good way to think of it is a Top of Rack Router that is virtualized.

Any serialisation can be used. It has JSON and Protocol buffers built in. I tend to use Protocol buffers. Not sure what Citus uses ???


It can also handle On Premise data via the NATS Leaf functionality. I use this quite a bit.

The NATS Server is the Cloud can reach the NATS Server running in Leaf mode anywhere. No network changes needed.

https://docs.nats.io/running-a-nats-service/configuration/leafnodes

https://docs.nats.io/running-a-nats-service/configuration/leafnodes/jetstream_leafnodes


Security is also baked in and grantable to others. It's very complete and a fair bit to explain here.

https://docs.nats.io/nats-concepts/security


Blue Green rolling upgrades also work well. It's called Lame Duck mode.

So you have no down time as you do a rolling upgrade of NATS Server.

https://docs.nats.io/running-a-nats-service/nats_admin/lame_duck_mode.


Tooling Makefile to operate everything listed above to get going fast..


# NATS-SERVER
# https://github.com/nats-io/nats-server
# https://github.com/nats-io/nats-server/releases/tag/v2.9.14
NATS_SERVER_BIN_NAME=nats-server
NATS_SERVER_BIN_VERSION=v2.9.14
NATS_SERVER_BIN_WHICH=$(shell which $(NATS_SERVER_BIN_NAME))

# NSC
# https://github.com/nats-io/nsc
# https://github.com/nats-io/nsc/releases/tag/v2.7.6
NATS_NSC_BIN_NAME=nsc
NATS_NSC_BIN_VERSION=2.7.6
NATS_NSC_BIN_WHICH=$(shell which $(NATS_NSC_BIN_NAME))

# NK
# https://github.com/nats-io/nkeys
# https://github.com/nats-io/nkeys/releases/tag/v0.3.0
NATS_NK_BIN_NAME=nk
NATS_NK_BIN_VERSION=v0.3.0
NATS_NK_BIN_WHICH=$(shell which $(NATS_NK_BIN_NAME))

# NATS_TOP
# https://github.com/nats-io/nats-top
# https://github.com/nats-io/nats-top/releases/tag/v0.5.3
NATS_TOP_BIN_NAME=nats-top
NATS_TOP_BIN_VERSION=v0.5.3
NATS_TOP_BIN_WHICH=$(shell which $(NATS_TOP_BIN_NAME))

# NATS
# https://github.com/nats-io/natscli
# https://github.com/nats-io/natscli/releases/tag/v0.0.35
NATS_CLI_BIN_NAME=nats
NATS_CLI_BIN_VERSION=v0.0.35
NATS_CLI_BIN_WHICH=$(shell which $(NATS_CLI_BIN_NAME))
rajeshkt78 commented 1 year ago

@gedw99 The CDC feature (Preview) for distributed tables is availabe in Citus 11.3 release. Please find the release notes on CDC here: https://www.citusdata.com/updates/v11-3/#cdc_support

We have tried Apache Kafka integration using debezium connectors and we are able to steam events from distributed table to kafka clients correctly.

Please try the CDC feature for your use case and give us any feedback.

Thanks.

rajeshkt78 commented 1 year ago

@gedw99 Had a chance to try the CDC feature for 11.3 release? please let us know if you have any feedback on this feature. Thanks.

gedw99 commented 1 year ago

Hey @rajeshkt78

I ended up using another approach that can do cdc on most databases as well as google sheets etc …

rajeshkt78 commented 1 year ago

Just curious.. what is the CDC method that works with most databases..?

gedw99 commented 1 year ago

https://github.com/ConduitIO/conduit-connector-protocol

rajeshkt78 commented 1 year ago

OK,so this is the postgres connector for that frameowrk right? https://github.com/ConduitIO/conduit-connector-postgres

Just FYI: Looking at the code, I think it uses the same logical replication slots to get data from PostgreSQL publication.. So to read data from Citus distributed table, it would still need the same method of creating a subscription for each worker node and get the changes though logical replication..

skysada commented 1 year ago

@rajeshkt78

We have tried Apache Kafka integration using debezium connectors and we are able to steam events from distributed table to kafka clients correctly.

I'm using V12.0 and attempting a debezium citus to pubsub sync.

it would still need the same method of creating a subscription for each worker node and get the changes though logical replication..

Does Debezium setup the subscriptions automatically, or what method are you referring to here?