Inter DC partitioning can disrupt replication

Partitioning a cluster of data centers running AntidoteDB can cause :ok g-set adds to not be fully replicated, or in some cases appear on other nodes only to not be present in the final read.

Details of the Jepsen test: https://github.com/nurturenature/fuzz_dist/blob/main/doc/antidotedb.md

Jepsen environment configured for AntidoteDB: https://github.com/nurturenature/jepsen-docker-workaround

Test commands:

# multiple dcs with no faults Ok
lein run test --topology dcs --workload g-set --nemesis none

# intra dc partitioning  Ok
lein run test --topology nodes --workload g-set --nemesis partition

# inter dc partitioning fails
lein run test --topology dcs --workload g-set --nemesis partition

# property driven tests don't always fail every run, can be run multiple times
lein run test --topology dcs --workload g-set --nemesis partition --test-count 5

The best way to initially interact with the test results is through the web server as described in jepsen-docker-workaround.

Here's a sample workflow tracing an anomaly:

click on invalid test from summary screen
click on results.edn
see 81 elements are missing from the final reads, pick one, i.e. 136
open history.txt, scroll to bottom, add see that 136 is only present on original node

false-results-history

Now lets look at an AntidoteDB log file for a node:

from the test summary screen
click on a node name to see all log files from that node
click on the AntidoteDB log of intestest
scroll to bottom to observe message loss recovery caused by partitioning

test-node-antidote

The timeline.html can also be used:

see :ok add for value 136 by worker 4
see it was replicated in read by worker 3 a few transactions later:

timeline-showing-repl

But missing from final read by worker 3:

timeline-missing-in-final-read

Please ask if there's any questions, desired changes to the test, environment, etc.

P.S. a good way to get a representative feel for what happens during inter dc partitioning:

# run test multiple times regardless of valid? true/false 
lein run test-all --topology dcs --workload g-set --nemesis partition --test-count 10

Most will be invalid. Take a quick look at the test summary pages, latency-raw.png to see partition timing/duration and any failed transactions (red/orange), results.edn for total :ok adds missing from final reads, and the general feel in jepsen.log.

Test failure does seem to group into several patterns:

several sequential adds not fully replicating
adds replicating to a node and then being lost on that node
zero mq getting disrupted and no further replication for remainder of test

AntidoteDB / antidote

Inter DC partitioning can disrupt replication #489