Open cloudcompute opened 1 year ago
Hello,
Thanks for your interest.
a. I'm not a CDC / Debezium expert, however here are some elements :
we want to be able to publish event to kafka even if the application crashes right after postgres write. From my understanding, CDC works with triggers that only happen once for each event, for instance in the following scenario event may never be published :
since our use case is pretty simple, we didn't feel the need to bring additional dependencies such as debezium
Regarding the "how", here is what Thoth does under the hood (simplified version):
b. @larousso is working on an akka free version using reactor instead
c. Current Kafka publication is speed is enough for our need, therefore we chose to keep it simple for Kafka publication. I think it may be possible to add an opt-in option for Avro / protobuf support, but it's definitely not on our roadmap. Perhabs it could be a good contribution, what do you think @larousso ?
d.
e. The main difference is in the driver used to communicate with postgres, Non blocking Jooq Akka implementation ise based on the reactive vertx driver for postgres, while Standard JOOQ/Kafka implementation uses a non reactive driver
f. Could you elaborate on this point ?
Hi,
a. With the actual implementation, the event are published in a near real time to kafka. I think that with debezium it would have a latency. The actual algorithm is to enqueue in memory the messages in order to send it to kafka. If kafka is down, the process crash and restart reading the events from database in order to get the unpublished messages while the hot messages are kept in memory. The idea is to try keep the order of the events.
b. The first version of thoth use akka as reactive stream implementation but with the change of licence, the akka implementation was kept and a reactor implementation was added. I think we won't support the akka implementation because we won't use it anymore. At the moment the only dependency that remains is for tests but I will remove it as soon as possible.
c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.
d.
e. I don't know if @ptitFicus uses jdbc in production, but I think the non blocking one is the more stable solution.
Thanks to both @larousso and @ptitFicus for the detailed responses.
Implementing CDC using Outbox Pattern
It preserves the order of events. Latency is there but very low. An asynchronous process runs in the background that copies the event records from Outbox table to Kafka. In case it is restarted, it reads again but it may cause duplicates in Kafka. This ensures at-least-once processing guarantee
semantics. If duplicates are there, the consuming service has to make sure to ignore the duplicate (idempotency).
hot messages are kept in memory
I don't know how have you implemented it, but what if the process that is keeping track of these messages itself die.
f. Is there no need for Kafka-proxy
It was left by mistake. I merged this question in point c. I wanted to ask whether it is possible to (or makes sense while using ES) to put a Kafka proxy in front of Kafka so that all events written/read to/from it (by microservices) go through this proxy. There are several benefits, for instance, if there are hundreds of microservices they don't need to know about the Kafka cluster details like location, they just need to need to know about this proxy. It has other benefits too which are stated here: [gRPC/REST proxy for Kafka]
c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.
Correct
With Regards to both of you
Hi @larousso and @ptitFicus
This project looks valuable. I have few queries relating to it.
a. Why have you decided not to use CDC-Debezium and how are you able to achieve it without using CDC.. pretty interesting? Documentation says: "It keeps trying writing to Kafka until succeeded", won't it impact performance?
b. What is the role of akka in this project? Can we do away with it because it is BSL now.
c. While writing to Kafka, Avro/protobuf formats have not been used which are really fast. Any reason for it? Is it possible to integrate Thoth with this gRPC/REST proxy for Kafka?
d. Does it support these: Optimistic Concurrency, (multiple users writing events to the same table at the same time) Event Replay (right from beginning for Audits etc.), Snapshots, topic compaction (delete/overwrite a message record for a given key), and Event Versioning?
e. What is the difference between Non blocking JOOQ and Standard JOOQ/Kafka implementations. In which case (use cases), shall we use which implementation?
f. Is there no need for Kafka-proxy
Thanks