fraktalio / fstore-sql

PostgreSQL as event store - event sourcing & event streaming
Other
25 stars 4 forks source link

Can we by pass Kafka? #2

Closed cloudcompute closed 6 months ago

cloudcompute commented 6 months ago

Hi @idugalic

I went through the SQL code.. PostgreSQL is used for both.. storing events and then streaming to consumer applications.

Do you think PostgreSQL (using this way) would act as a replacement for Kafka which is a pure event streaming platform?

Thanks

idugalic commented 6 months ago

Hi @cloudcompute ,

The short answer is, yes, you can use Postgres for event-sourcing and event-streaming. It can replace Kafka for the event-streaming part, especially till some point of the scale. We (@fraktalio) are using it (Postgres) as the only platform to support event-driven design. It is cost-effective and simple to manage. We are running this SQL model on https://supabase.com/

I must say that relational databases are not built for streaming, and Kafka will do a better job in event streaming (NOT event sourcing). Nevertheless, Postgres will take you very far, with a lesser footprint. Also, Postgres has a large community with a lot of plugins available. Timescale aggregations are very interesting. You can build statistical models on top of your events. It can be very valuable to the business.

Best, Ivan

cloudcompute commented 6 months ago

Hi @idulagic

It can replace Kafka for the event-streaming part, especially till some point of the scale.

Till some point.. I fully agree.

Taken from Quora source

General-purpose databases - both relational and the NoSQL kind - are generally geared towards a broad spectrum of storage and retrieval patterns. Naturally, performance is a compromise. Kafka can comfortably handle moving millions of records per second on commodity hardware, with latencies that vary between 10s and 100s of milliseconds. It can achieve this because of certain design decisions, such as using an append-only log, batching reads and writes, batch compression, unflushed buffered writes (avoid fsync), using zero-copy (I/O that does involve the CPU and minimises mode switching), minimising garbage collection, and the like. Conversely, Kafka does not offer efficient ways of locating and retrieving records based on their contents - something that databases are designed to do reasonably well. (Which is why Kafka is not regarded as a general-purpose database.)

Kafka has built-in mechanisms for fanning out data and supporting multiple disjoint consumers by way of persistent offsets, which are managed automatically via a construct called a consumer group.

If you want to add some more features to this library

Kindly have a look at the message-db which is written on top of PostgreSQL

I have a question.

We use databases for write and read purposes. Writes in a relational db are very fast because they are row-oriented while reads are not. For big data analytics and machine learning, we use data warehouses / data lakes like Iceberg where reads are very fast because they are column oriented.. Parquet format.

Typical flow is something like this

a. Producer App --> PostgreSQL (event sourcing in the beginning) --> Kafka (using Outbox or Debezium) --> Stream processors (Flink, Spark) --> Consumer App (ElasticSearch, Dashboards, Alerts, etc)

b, And then the data is finally written to Snowflake, Iceberg etc.

Can't the flow be like this that bypasses PostgreSQL:

  1. Producer App --> Kafka (store events temporarily until the 2 and 3 operations are carried out)

  2. Let Flink, Spark do the real-time event stream processing

  3. Pass the results to consumer apps

  4. Write the events to Iceberg (event sourcing in the last) and remove them from Kafka. Kafka may read from Iceberg the historical events later.

Thank you once again

idugalic commented 6 months ago

Kindly have a look at the message-db which is written on top of PostgreSQL

Thanks! I will have a look.

Typical flow is something like this

a. Producer App --> PostgreSQL (event sourcing in the beginning) --> Kafka (using Outbox or Debezium) --> Stream processors (Flink, Spark) --> Consumer App (ElasticSearch, Dashboards, Alerts, etc)

b, And then the data is finally written to Snowflake, Iceberg etc.

Yes, if your business is already using Postgres to store events and has a need and bandwidth to scale streaming, Outbox and Debezium with Kafka make sense most probably.

At Fraktalio we wanted to push this further and enable smaller teams to adopt event-driven architecture by slowly introducing additional components into the landscape, per need. We hope to accommodate that with Postgres SQL model we shared in this repo. Our intentions are good :)

Can't the flow be like this that bypasses PostgreSQL:

Producer App --> Kafka (store events temporarily until the 2 and 3 operations are carried out)

Let Flink, and Spark do the real-time event stream processing

Pass the results to consumer apps

Write the events to Iceberg (event sourcing in the last) and remove them from Kafka. Kafka may read from Iceberg the historical events later.

I like your first approach better. Kafka truly shines with event streaming, but it sucks with event-sourcing.

How would you implement optimistic locking in the case of concurrent writes to Kafka robustly? (usually, you need to guarantee to order events per aggregate/decider stream).

I would always choose a general-purpose ACID compliment database (relational, graph, or key-value) or a specific event store database (Axon Server, EventStoreDB) to store events robustly rather than Kafka.

cloudcompute commented 6 months ago

I would always choose a general-purpose ACID compliment database (relational, graph, or key-value) or a specific event store database (Axon Server, EventStoreDB) to store events robustly rather than Kafka

You didn't get me. Kafka, in no way, should be used for event sourcing. I meant to say, when and where the events should be stored.

First approach

Events are sourced (stored) immediately in a relational database and forwarded to Kafka later on.

Second approach

We are not storing the events immediately (bypassing PostgresSQL event storage). A producer application is sending the events to Kafka first where they are stored temporarily for processing etc. and sourced (stored) in a column-oriented datalake like Iceberg in the last.

image

I want to point to an architecture somewhat like this with the following exception: The data is first getting stored in the 'Data sources' layer and again in 'Storage layer', there is redundancy. We can probably eliminate the first layer, if we are building an application from scratch. I am not sure whether it makes much sense though.

Are you using Kotlin-based event sourcing in production?.. this and this

idugalic commented 6 months ago

I see. I still have a problem understanding this part:

A producer application is sending the events to Kafka first where they are stored temporarily

It is temporary, I get it, but still, Kafka plays an important role here. Would you like to validate if the events are ordered correctly at this point? Or, do you want to delegate this validation process and optimistic locking to the downstream ACID database? I fail to understand how this can be implemented robustly, but it does not mean you are wrong in your assumptions :)

We are using both,

in production.

PS. I suggest using Discussions for this kind of conversation.

Best, Ivan

cloudcompute commented 6 months ago

Kafka guarantees the ordering within a partition and not across partitions. Therefore, the processes that consume & process those events and write the events to the downstream database have to make sure about the ordering using timestamps.

This is just one of the reasons I don't favour using Kafka. Others points include it consumes lot of resources and difficult to administer.

I am closing this comment for now. I'll make sure using 'Discussion' for future conversations.

PS: If you have any development-related work that you want to outsource, great if you can consider me.

Thanks and Regards

idugalic commented 6 months ago

I will consider it. We are helping teams adopt event sourcing and event-driven design by building open-source libraries, utilities, and providing consultancy. We are inspired by DDD, event-sourcing, functional programming, and data-driven communities. Thanks for your questions and showing your interest. :heart: