[BUG] Active-Active state transfer storms

mleklund commented 3 years ago

Describe the bug

Certain common circumstances can cause a storm of state transfers.

To reproduce

In an active active cluster if you do a rolling restart of the nodes for something like configuration changes, or version updates you can cause a storm of complete state transfers and "Active Replica - LOADING Redis is loading the dataset in memory on full sync" errors.

Expected behavior

The purpose of running something like active-active is to give yourself write high availability especially across geographical regions, and because of the way code is currently implemented you have read but not write high availability.

Additional information

From the active replication wiki:

On boot each KeyDB instance will compute a dynamic UUID. This UUID is not saved and exists only for the life of the process. When a replica connects to its master it will inform the master of its UUID and the master will reply with its own. The UUIDs are compared to inform the server if two clients are from the same KeyDB instance (IPs and ports are insufficient as there may be different processes on the same machine). The UUIDs are used to prevent rebroadcasting changes to the master that sent them if it is also our replica.

So, when a node disconnect to restart it restarts with a new UUID and any repl-backlog for that server is dropped on the floor, this requires a complete state transfer from any running nodes to the restarted node, and then a reciprocal state transfer from the restarted node to any running nodes. The replicas then more state transfers, increasing read latency and write unavailability. It's possible that allow-write-during-load fixes this, but it comes with ominous warnings in the keydb.conf file.

mleklund commented 3 years ago

There is talk of partial resynchronization, but I do not think this seems to be happening for active-active, may only apply to active-passive.

BobEQAlpha commented 3 years ago

Hi Mleklund

Thank you for contacting EQAlpha. We appreciate you reaching out to us. To help us address your specific issue better.

In a worst case scenario, assuming you're using a mesh network topology of N master nodes, there will be up to N dump.rdb file transfers for each other master node. This can lead to up to N*N file transfers for the entire network, leading to your transfer storms.

1) What is your current topology(ring versus mesh) and number of master nodes?

If you use unidirectional ring network topology for the initial synchronization, that can at least smoothen the storm a bit. Then re-add the additional connections later on.

JarmoSa commented 3 years ago

We also are having the same issue + this full reload also can cause the nodes to run out of memory. We have 4GB data in keydb and each of three multi-master nodes do have 10GB of ram - this is not enough if we do rolling restart and the keydb rdb file is kept in each node. Only way currently to sort this out is to clean the RDB file on some of the nodes before the keydb is started.

I think this is related to the https://github.com/EQ-Alpha/KeyDB/issues/353 = allow having static master id for each node. This would prevent the clients to reload the full 4GB and run out of memory.

mikeries commented 3 years ago

We've encountered this also. We have about the simplest configuration possible -- 3 nodes which are all replicas of each other, with active-active and multi-master enabled. Each node has 16GB RAM and currently we have in our development system about 500MB of data. If I have to take a node down to adjust the configuration we end up bouncing data around endlessly upon restarting it. I've increased the client-output-buffer replica limits to 2gb 1gb 60 but that only seems to cause the nodes to get close to being out of RAM before giving up.

@BobEQAlpha when you say N*N transfers, do you mean that this process will actually stop after 9 transfers? I'm not sure I've ever waited that long...

mleklund commented 3 years ago

That is my experience @mikeries. It just was not acceptable performance for our use cases. I hope they get it figured out, because outside of this issue, it seems to be an outstanding product.

mikeries commented 3 years ago

Maybe I'll try the ring configuration, and if that still doesn't work I'll start dropping the features that made me pick this over redis... Multi-master and/or Active-Active. I really don't want to deal with sentinel, but this needs to work.

BobEQAlpha commented 3 years ago

@mikeries I believe it is actually 6 transfers rather than 9.

Let say you have 3 nodes A, B C.

A will transfer data to B, C. B will transfer data to A, C. C will transfer data to A, B.

mikeries commented 3 years ago

Thanks for the clarification, but I'm not having any luck. I have tried a unidirectional ring configuration and turning off multi-master, but it seems that if I shutdown a node and restart it, the full database gets passed around and seems to grow larger and larger until one of the nodes runs out of RAM. If I'm understanding the log output, somehow a 1Gb rdb file grew to more than 11Gb in a few minutes.

BobEQAlpha commented 3 years ago

When a node send its data to another node, it requires a replication stream that uses memory which is likely responsible for your out of memory issue.

Does your master nodes have any attached read-only replicas? Both maintaining a master-replica connection and propagating writes to a read-only replica requiressome memory.

at0rvk commented 2 years ago

I also had a two node setup where I had this issue, only way to get everything healthy again was to remove the rdb on one host and set repl-timeout 300 to prevent master timeout while the other host was recovering.

msotheeswaran-sc commented 1 year ago

Is there anyone still experiencing this issue, will be helpful to understand how to prioritize this.

mikeries commented 1 year ago

I gave up and am using the sentinels now.

mleklund commented 1 year ago

I gave up as well.

Snapchat / KeyDB

[BUG] Active-Active state transfer storms #333