MAIF / thoth

Event sourcing in java with vavr, akka stream and vertx reactive PG driver
https://MAIF.github.io/thoth/
Apache License 2.0
31 stars 9 forks source link

Few queries #54

Open cloudcompute opened 1 year ago

cloudcompute commented 1 year ago

Hi @larousso and @ptitFicus

This project looks valuable. I have few queries relating to it.

a. Why have you decided not to use CDC-Debezium and how are you able to achieve it without using CDC.. pretty interesting? Documentation says: "It keeps trying writing to Kafka until succeeded", won't it impact performance?

b. What is the role of akka in this project? Can we do away with it because it is BSL now.

c. While writing to Kafka, Avro/protobuf formats have not been used which are really fast. Any reason for it? Is it possible to integrate Thoth with this gRPC/REST proxy for Kafka?

d. Does it support these: Optimistic Concurrency, (multiple users writing events to the same table at the same time) Event Replay (right from beginning for Audits etc.), Snapshots, topic compaction (delete/overwrite a message record for a given key), and Event Versioning?

e. What is the difference between Non blocking JOOQ and Standard JOOQ/Kafka implementations. In which case (use cases), shall we use which implementation?

f. Is there no need for Kafka-proxy

Thanks

ptitFicus commented 1 year ago

Hello,

Thanks for your interest.

a. I'm not a CDC / Debezium expert, however here are some elements :

b. @larousso is working on an akka free version using reactor instead

c. Current Kafka publication is speed is enough for our need, therefore we chose to keep it simple for Kafka publication. I think it may be possible to add an opt-in option for Avro / protobuf support, but it's definitely not on our roadmap. Perhabs it could be a good contribution, what do you think @larousso ?

d.

e. The main difference is in the driver used to communicate with postgres, Non blocking Jooq Akka implementation ise based on the reactive vertx driver for postgres, while Standard JOOQ/Kafka implementation uses a non reactive driver

f. Could you elaborate on this point ?

larousso commented 1 year ago

Hi,

a. With the actual implementation, the event are published in a near real time to kafka. I think that with debezium it would have a latency. The actual algorithm is to enqueue in memory the messages in order to send it to kafka. If kafka is down, the process crash and restart reading the events from database in order to get the unpublished messages while the hot messages are kept in memory. The idea is to try keep the order of the events.

b. The first version of thoth use akka as reactive stream implementation but with the change of licence, the akka implementation was kept and a reactor implementation was added. I think we won't support the akka implementation because we won't use it anymore. At the moment the only dependency that remains is for tests but I will remove it as soon as possible.

c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.

d.

  1. for optimistic concurency, a sequence num can be used to detect updates from a wrong version.
  2. for event replay, there is a stream method from the API that can be used to read all the events from the database.
  3. topic compaction : at the moment the shard id is the entity id so if you use compaction in kafka i think that it will kept the last event for a given entity.

e. I don't know if @ptitFicus uses jdbc in production, but I think the non blocking one is the more stable solution.

cloudcompute commented 1 year ago

Thanks to both @larousso and @ptitFicus for the detailed responses.

Implementing CDC using Outbox Pattern

It preserves the order of events. Latency is there but very low. An asynchronous process runs in the background that copies the event records from Outbox table to Kafka. In case it is restarted, it reads again but it may cause duplicates in Kafka. This ensures at-least-once processing guarantee semantics. If duplicates are there, the consuming service has to make sure to ignore the duplicate (idempotency).

hot messages are kept in memory

I don't know how have you implemented it, but what if the process that is keeping track of these messages itself die.

f. Is there no need for Kafka-proxy

It was left by mistake. I merged this question in point c. I wanted to ask whether it is possible to (or makes sense while using ES) to put a Kafka proxy in front of Kafka so that all events written/read to/from it (by microservices) go through this proxy. There are several benefits, for instance, if there are hundreds of microservices they don't need to know about the Kafka cluster details like location, they just need to need to know about this proxy. It has other benefits too which are stated here: [gRPC/REST proxy for Kafka]

c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.

Correct

With Regards to both of you