Multi-DC / eventual / partially-async replication

vaizki commented 4 years ago

I implemented a partially-async replication patch for the original Disque project and back then put in a hopeful PR (https://github.com/antirez/disque/pull/200). This was a vital feature for my use case (multiple datacenters) and something that I think has wider appeal and fits the project since Disque is designed as an "AP" system.

Is this type of replication something that could make its way to the Disque module? I am fully aware that the first priority is to reach feature / stability parity with the old project but I wanted to put this in early for consideration as it may simmer in some brains and spark ideas on even better ways to do especially multi-DC replication in split-brain situations.

P.S. Great to see Disque reborn and really appreciate the wonderful README!

antirez commented 4 years ago

Hello @vaizki, I find this feature very interesting, it could be one of the first additions I do once the original tests will pass all and the persistence gets implemented. Thanks.

antirez commented 4 years ago

@vaizki please also check https://github.com/antirez/disque-module/issues/5

vaizki commented 4 years ago

Your ideas in #5 look good for synchronous multi-AZ replication but my patch was trying to resolve a different set of problems:

When doing a synchronous REPLICATE across AZs, the RTT between them (over the Atlantic for example) was too much for the original project's replication timers. I explained some of it in https://github.com/antirez/disque/issues/189#issuecomment-297319674
Even with the timers adjusted to prevent retries etc, a 100ms RTT will result in a maximum client command rate of 10 ADDJOBs per second if confirmation is needed from both AZ/DCs.
Sometimes you only have 2x AZ (my usual use case) and one of them is not available. In this case you cannot use sync replication that needs confirmation from both AZs.

So in essence, I am proposing separating desired replication and write concern in ADDJOB.

Desired replication defines the number of replicas / AZs that the cluster will attempt to achieve as quickly as possible
Write concern defines the replication that needs to be confirmed before the command returns success to the client

I did also consider a model where the ADDJOB timeout would tie into this, allowing for a client to wait for full replication to happen or if the timeout is reached and partial replication has been achieved then a "semi-success" is returned and the replication completed asynchronously asap. But these ideas did not really work with the original project and I needed to move on.

antirez commented 4 years ago

@vaizki thanks, I understood the problem, but what is not completely clear to me is how useful it is a desired replication level that sometimes is not reached. Does not offer any true guarantee, it can just improve the jobs rate survival in case of failures. May still be interesting, and I didn't mean for my issue to replace the functionality of the one that you proposed. #5 is more like a step towards providing more guarantees. #4 instead attempts to improve the percentage of jobs lost that are replicated with weaker guarantees. Both have their place I guess, but #4 is more like a "best effort safety" feature.

vaizki commented 4 years ago

At least for me, doing synchronous writes to two AZ with 100ms+ RTT between them is just not an option. It will kill the performance and occasionally kill the whole application if the other AZ is not responding. It would be like writing to REDIS key and waiting for a slave (maybe running or not somewhere far away) to ack the replication. Of course I would prefer 3x AZ with minimal latency between them but this is not an available option.

An availability + performance comparison can also be drawn with RAID1 disk mirroring; if one side of the mirror is down the remaining side will still accept writes just as quickly as before but as soon as the faulty side recovers it is synced and N=2 redundancy achieved again. So for replication, it is about being able to operate in a degraded state while requesting that eventually the desired replication will be achieved.

Because Disque is a message passing system which by nature decouples producers and consumers, there is not as much worry about split-brain or diverging data in cluster members as there is with RAID1, REDIS master-slave or an ACID RDBMS cluster. Working on the message level with the immutable nature of messages guarantees that IF there is a message in this cluster member, then it has the correct content so I can process it.

And that is exactly why I want to use Disque and I need this tunable "write concern", as I need availability and partition tolerance above all else and to get that I can tolerate for example the same job being processed more than once.

antirez commented 4 years ago

It's not a matter of consistency indeed, the replication factor in Disque tunes the ability to survive after failures. In that regard, to tell: give me the green light when the message was replicated 2 times but actually try to replicate it 5 times, can be seen as a minimum guarantee of durability versus optimal durability versus latency concerns. So indeed it makes sense but still the messages will be guarantee you survive a single node failure. However with this option, if there is a two nodes failure the percentage of messages that will get lost will be minor.

antirez / disque-module

Multi-DC / eventual / partially-async replication #1