Seperate storage engine from zeebe

We are using zeebe in production for the last couple of months and wanted to get it to a more stabilized manner. We have seen many issues with the way its handling data and at times we are losing the visibility to the insides of zeebe data due to how the exporter works. and how it blocks and impact the performance of the entire broker. By going through the issue list that currently tracked on zeebe, i could see that majority of them are related to storage engine not really related to managing workflow. Which actually means that we are reinventing the wheel of how to store and retrieve the data efficiently.

I was exploring a few of the articles and found the below architecture from uber cadence. which is again another workflow system, overview

I wonder why zeebe cant has similar architecture. I agree that its against to the minimal components to maintain a workflow engine. but increasing the managed services and serverless service available(eg Cassandra available in aws,gcp and azure), the whole headache of storing data can be given off to other solutions where 100s of developers are constantly working just to improve/enhance the features just to the storage.

This gives another flexibility, today zeebe not giving the ability to query the data directly from zeebe, we have to export the data using an exporter, and exporter adds additional load on the system. and many time we are exporting causes delay in knowing the real workflow status at any given point in time. And maybe that is the reason we have removed the workflow query portion from the zeebe server.

Can we rethink on this and see if its really mean to put lots of dev effort going to reinvent the storage system, rather than giving the choice to the end-users, by writing storage adaptors for the efficient query requirements that suit the user. for example, increased popularity in serverless storage can literally bring down the cost of running a zeebe cluster.

OR Can we address the following:

Real time status of a workflow -> what state it is running.
Can exporter run as a independent pod which wont affect the CPU and memory error on broker, so real HA of broker can be achieved
Ability to query the actively running workflow for app logics where delay and inconsistencies matter.

By bringing a primary storage layer as adaptor based, gives many possibilities to evolve the query portion on the system. Storing data without an effective way to query gonna be troublesome at times.

@ysaakpr thank you for raising this up :+1: ...and sorry for the delay :see_no_evil:

Let's try to clarify some points. So, we have a better understanding of how things work and can focus on the concrete problems that you want to solve.

We have seen many issues with the way its handling data and at times we are losing the visibility to the insides of zeebe data due to how the exporter works. and how it blocks and impact the performance of the entire broker.

Do you have concrete issues with the way how Zeebe handles data? Please list them.

Which information do you miss from the exporter API?

The exporters don't have a direct effect on the workflow processing performance. They're running in a different thread pool that can be configured separately. However, since they are running in the same JVM as the broker, they are using the same resources. So, they should be optimized and don't do too much processing.

Exporters have an impact on how fast data can be removed from the broker. When a record is seen by all exporters then it can be removed when the next snapshot is created.

I wonder why zeebe cant has similar architecture

First, let's explain Zeebe's "storage engine". Zeebe is workflow engine that is built on stream processing. All records (events and commands) are written on a stream and processed in-order.

It can have multiple partitions to scale, each with its own stream and processors. And each partition can be replicated to a given number of nodes to be fault-tolerant.

For the stream processing (e.g. storing the records, multiple partitions, replicate partitions), Zeebe uses (a fork of) the Atomix framework.

Additionally, Zeebe has a storage to materialize the current state of the stream. This storage is built from the records of the stream. It is read when a record is processed to know the latest state of the workflow. For the storage, Zeebe uses RocksDB, an embedded key-value store.

In order to remove data and to improve the startup time, Zeebe creates snapshots. A snapshot is a copy of the current storage. When the broker starts then it can read the snapshot and don't need to read previous records.

You may ask why Zeebe doesn't use a classic (relational) database to store all data?

In our experience with Camunda BPM, we see that a central database is often the performance bottleneck. So, we decided to build the workflow engine on top of stream processing.

This gives another flexibility, today zeebe not giving the ability to query the data directly from zeebe, we have to export the data using an exporter, and exporter adds additional load on the system. and many time we are exporting causes delay in knowing the real workflow status at any given point in time. And maybe that is the reason we have removed the workflow query portion from the zeebe server.

Zeebe doesn't provide a query API for different reasons. Two major reasons are:

queries should no affect the workflow processing performance
it is hard/impossible to build one query API that fits for all

So, we built the exporter API and allow building your own query API on top. The API can be optimized for your use case and store the data to fit the queries. Since the API is provided by a different application, it does not affect the broker performance and can scale as needed.

But exporters are not designed to be used for real-time updates (with a very small delay). The exporters may be always a bit behind the actual workflow processing depending on the load, resources, and the enabled exporters.

If you need real-time updates for your business process then we may want to include this aspect in the workflow. For example, if a system should be informed of a certain state or react on it then you should model it directly in the workflow as a service task.

Can we rethink on this and see if its really mean to put lots of dev effort going to reinvent the storage system, rather than giving the choice to the end-users, by writing storage adaptors for the efficient query requirements that suit the user. for example, increased popularity in serverless storage can literally bring down the cost of running a zeebe cluster.

Currently, the stream processing and the database storage are major parts of Zeebe. We have internal abstractions for these parts to avoid hard bindings to a concrete framework like RocksDB. But it is not so easy to expose them to users.

On the technical side, for example, Zeebe stores its data in RocksDB in a binary format for performance reasons. Switching the database may not give you more transparency or control because you can not read the data easily.

On the operational side, it is hard to predict how Zeebe works and performs using a different stream processing framework or database. We may not have the resources to test every possible combination. And it could make it hard to help people if they're using different core technologies.

Real time status of a workflow -> what state it is running.

I recommend to include the aspect of the real-time update in the workflow itself. This would make this part transparent and fits to the current architecture.

If this doesn't work for you then please describe your use case.

Can exporter run as a independent pod which wont affect the CPU and memory error on broker, so real HA of broker can be achieved

That is an interesting idea. We also thought about it and have ideas to externalize the exporter part. But we didn't have time yet and it is not our highest priority.

Ability to query the actively running workflow for app logics where delay and inconsistencies matter.

Same as for the real-time update, I recommend to include this aspect in the workflow itself.

The term "actively running workflow" is interesting here because the actual state of a workflow depends on the time. Since we have a stream of records, the state depends on the actual reading position within the stream.

camunda / camunda

Seperate storage engine from zeebe #5236