elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.98k stars 24.75k forks source link

Jepsen transient failures under network partition conditions #7549

Closed pilvitaneli closed 8 years ago

pilvitaneli commented 10 years ago

Hi! Jepsen tests include five nemeses (test scenarios) that introduce different types of network partitions (see here). The tests add documents to index before, during and after these partitions, and verify that the documents which were acknowledged during the partitions are retrievable afterwards. Sometimes the tests indicate that a number of documents were indexed, but are not retrievable---however, this does not happen on every run (of the same scenario). For example, in a run of 20 times each (against 598854dd72d7fb01a7e26a9dad065de3deaa5eb7), the following :lost-frac amounts were reported:

isolate-self-primaries-nemesis 244/361, 2/733, 1/607, 1/603, 1/213, 65/216 (and 14 times 0) nemesis/partition-random-halves 1/355, 1/226, 4/733, 1/433 (and 16 times 0) nemesis/partition-halves 1/65, 1/438, 4/715, 2/457, 6/731, 1/435, 9/433 (and 13 times 0) nemesis/partitioner nemesis/bridge 2/415, 3/253, 2/383, 7/754, 1/786, 1/767 (and 14 times 0) nemesis/partition-random-node does not report any lost documents.

In total, out of a 100 runs, 23 failed.

dakrone commented 10 years ago

Hi @pilvitaneli, thanks for the testing results!

We're actively investigating Jepsen tests on top of our own tests, which resulted in #7572. The Jepsen tests helped verify that we fixed the split brain issue (it no longer happens). In all of our runs though, we couldn't simulate a result similar to your first run (the isolate-self-primaries-nemesis where you lost 244/361), still trying, but I might circle back with you to figure out how you ended up with those results. We do manage to simulate the smaller scale data loss that we believe relates to #7572, but this is also still under investigation.

I'll let you know how our continued testing with Jepsen goes, thanks again for your results!

pilvitaneli commented 10 years ago

Running just isolate-self-primaries-nemesis 50 times in a succession results in 22 failures: 1/403 404/653 1/583 6/667 287/395 4/583 16/655 3/1037 8/807 1/565 1/555 5/638 1/626 3/784 3/653 2/621 3/632 1/254 1/610 3/307 11/668 1/446

dakrone commented 10 years ago

@pilvitaneli circling back to this after a while, do you happen to have the commit sha of Jepsen that you are using for running your tests? I'd like to make sure we run the same tests.

pilvitaneli commented 10 years ago

I haven't run in a while, but last was with https://github.com/aphyr/jepsen/commit/761693bd9b2a71528cb254e357ea1a6e8878129d . It does not appear as though there are considerable changes after that, but I could try to re-run with current master.

dakrone commented 8 years ago

Going to close this as it's been almost 2 years and we have a different issue we are tracking things for the 5.0 release - #20031