Closed pilvitaneli closed 8 years ago
Hi @pilvitaneli, thanks for the testing results!
We're actively investigating Jepsen tests on top of our own tests, which resulted in #7572. The Jepsen tests helped verify that we fixed the split brain issue (it no longer happens). In all of our runs though, we couldn't simulate a result similar to your first run (the isolate-self-primaries-nemesis
where you lost 244/361), still trying, but I might circle back with you to figure out how you ended up with those results. We do manage to simulate the smaller scale data loss that we believe relates to #7572, but this is also still under investigation.
I'll let you know how our continued testing with Jepsen goes, thanks again for your results!
Running just isolate-self-primaries-nemesis 50 times in a succession results in 22 failures: 1/403 404/653 1/583 6/667 287/395 4/583 16/655 3/1037 8/807 1/565 1/555 5/638 1/626 3/784 3/653 2/621 3/632 1/254 1/610 3/307 11/668 1/446
@pilvitaneli circling back to this after a while, do you happen to have the commit sha of Jepsen that you are using for running your tests? I'd like to make sure we run the same tests.
I haven't run in a while, but last was with https://github.com/aphyr/jepsen/commit/761693bd9b2a71528cb254e357ea1a6e8878129d . It does not appear as though there are considerable changes after that, but I could try to re-run with current master.
Going to close this as it's been almost 2 years and we have a different issue we are tracking things for the 5.0 release - #20031
Hi! Jepsen tests include five nemeses (test scenarios) that introduce different types of network partitions (see here). The tests add documents to index before, during and after these partitions, and verify that the documents which were acknowledged during the partitions are retrievable afterwards. Sometimes the tests indicate that a number of documents were indexed, but are not retrievable---however, this does not happen on every run (of the same scenario). For example, in a run of 20 times each (against 598854dd72d7fb01a7e26a9dad065de3deaa5eb7), the following :lost-frac amounts were reported:
isolate-self-primaries-nemesis 244/361, 2/733, 1/607, 1/603, 1/213, 65/216 (and 14 times 0) nemesis/partition-random-halves 1/355, 1/226, 4/733, 1/433 (and 16 times 0) nemesis/partition-halves 1/65, 1/438, 4/715, 2/457, 6/731, 1/435, 9/433 (and 13 times 0) nemesis/partitioner nemesis/bridge 2/415, 3/253, 2/383, 7/754, 1/786, 1/767 (and 14 times 0) nemesis/partition-random-node does not report any lost documents.
In total, out of a 100 runs, 23 failed.