Currently replicants have to copy the entire contents of the tables when they reconnect, even after a short while. With large enough volume of data it may hinder cluster recovery after disaster or maintenance. Moreover, it makes it almost impossible to rebalance the load on the core nodes.

Initially we tried to solve this problem by persisting the transaction log, so the replicants recovering after a reconnect could replay it instead of going through the entire bootstrap procedure.

That approach proved to hurt performance too much to be practical. In addition, flapping client connections can often generate delete -> add -> delete -> ... loops in the transaction logs, making them larger than the table itself, making the whole idea of replaying transaction log questionable.

Below I describe an alternative approach. Instead of trying to avoid bootstrap, we could speed it up.

For each shard we create so called "delta set", that consists of N set-like tables (plain ets or plain rocksdb) that store the following records: {{Table, Key}, X} where X is a value of a counter.
Every M seconds we rotate the tables in the delta set in a ring buffer fashion, contents of the oldest table are dropped. We also increase counter X by 1.
mria_rlog_server process, as it processes intercepted transactions, writes each affected key to the table with the current delta set table. The existing keys are simply overwritten.
When a replicant connects to the core, it tells it its current logical timestamp in the hello message
The core checks if the timestamp is covered by its delta set.
If it is, instead of doing the normal bootstrap server loop (https://github.com/emqx/mria/blob/main/src/mria_bootstrapper.erl#L192) it loops over the delta set keys, and it doesn't send clear_table command (https://github.com/emqx/mria/blob/main/src/mria_bootstrapper.erl#L240), so the replicant preserves its local data.
If the replicant's data is too old, bootstrap server does the normal loop.

Pitfalls:

Clock skews. Possible solution: make sure to replay much older keys than the replicant's timestamp.
Care should be taken to avoid "skipping" over a part of delta set, as tables get rotated while the bootstrap is running. This could be prevented, perhaps, by making sure X counter doesn't change during bootstrap, and only increments by 1 while jumping to the next table.
When the core node itself restarts, or rlog_server process restarts, it can skip certain keys. In this situation replaying the delta set would lead to inconsistent results, so the best course of action would be to simply drop it. It means any replicant node that connects to a freshly restarted core node has to bootstrap from scratch.

emqx / mria

[RFC] Speed up bootstrap by maintaining a "delta set" table #145

Pitfalls: