FgForrest / evitaDB

evitaDB is a specialized database with an easy-to-use API for e-commerce systems. It is a low-latency NoSQL in-memory engine that handles all the complex tasks that e-commerce systems have to deal with on a daily basis. evitaDB is expected to act as a fast secondary lookup/search index used by front stores.
https://evitadb.io
Other
61 stars 7 forks source link

Change Data Capture support #187

Open novoj opened 1 year ago

novoj commented 1 year ago

In order to keep the remote clients in sync, we need to introduce some kind of change data collection (i.e. ability to stream information about changes in evitaDB servers to all interested clients). The same principle will one day be used to synchronize changes between master and replicas.

Everything starts with a ...

Client request to watch changes:.

The client will have to explicitly state what operations it wants to monitor and in what scope. The request must be made at the beginning of the monitoring stream, but can be extended at any time. A single client can make multiple requests for the same stream of changes. The request has the following format

The request is confirmed by the server and assigned a unique ID. The client can cancel the request at any time using the assigned ID. The client can also cancel all its requests. If the connection to the server is interrupted - connection is lost, all requests are immediately dropped and the client must initiate all of them again when the connection is re-established.

In order for the client to be able to follow the previous communication, it could use the since part, where it can specify which version is the last one it knows about, and the server will immediately send all changes since then. If the history is too long and the server doesn't have the information yet (the WAL has been purged), the exception will be thrown and the client will have to rebuild the information from scratch.

Server sends changes that match the filter:.

The server maintains a list of filters per open stream to the client, and sends all changes (or just the information about their existence) that match the filter defined by a scope/site combination to a particular client. If the since part is provided and the version doesn't match the current version, the server must scan the WAL to find the particular transaction_id and scan all changes since that moment to find and replay all changes for the particular client. If the last known transaction_id in the WAL is greater than the requested one, an error is sent and the client must rebuild all its data from scratch.

Maintenance events:.

The server will periodically send the last observed transaction_id to all clients, even if no monitored change has occurred. This allows to limit the amount of data to be scanned in case the connection is lost and it would have to be reinstantiated. The client must update its internal last seen transaction_id.

novoj commented 1 year ago

@lukashornych @Khertys we need to analyze the current technical possibilities of the underlying protocols - i.e. gRPC streams, GraphQL WebSockets and REST possibilities. Beside WebSockets there is also older specification of SSE. We need to consider all the possibilities.

lukashornych commented 1 year ago

Summary for GQL/REST after some discussion:

Things to analyze more:

novoj commented 1 year ago

As we discussed, I've pushed changes to the CDC mechanism. The new way is prepared only for SYSTEM part of the CDC. When our prototype works I'll extend it to the data capture part of the API. What has changed:

@khertys look for the TODO TPO for key changes on your side.

lukashornych commented 12 months ago

I've managed to implement GraphQL subscriptions. System captures are working fully, catalog data captures are being properly called but we don't have evitaDB support, so no testing so far, and catalog schema captures have some kind of problem, where GQL is not being properly called and I will need to look into it.

Otherwise this is still need to be don by me: