minimum_master_nodes does not prevent split-brain if splits are intersecting

saj commented 11 years ago

G'day,

I'm using ElasticSearch 0.19.11 with the unicast Zen discovery protocol.

With this setup, I can easily split a 3-node cluster into two 'hemispheres' (continuing with the brain metaphor) with one node acting as a participant in both hemispheres. I believe this to be a significant problem, because now minimum_master_nodes is incapable of preventing certain split-brain scenarios.

Here's what my 3-node test cluster looked like before I broke it:

Here's what the cluster looked like after simulating a communications failure between nodes (2) and (3):

Here's what seems to have happened immediately after the split:

Node (2) and (3) lose contact with one another. (zen-disco-node_failed ... reason failed to ping)
Node (2), still master of the left hemisphere, notes the disappearance of node (3) and broadcasts an advisory message to all of its followers. Node (1) takes note of the advisory.
Node (3) has now lost contact with its old master and decides to hold an election. It declares itself winner of the election. On declaring itself, it assumes master role of the right hemisphere, then broadcasts an advisory message to all of its followers. Node (1) takes note of this advisory, too.

At this point, I can't say I know what to expect to find on node (1). If I query both masters for a list of nodes, I see node (1) in both clusters.

Let's look at minimum_master_nodes as it applies to this test cluster. Assume I had set minimum_master_nodes to 2. Had node (3) been completely isolated from nodes (1) and (2), I would not have run into this problem. The left hemisphere would have enough nodes to satisfy the constraint; the right hemisphere would not. This would continue to work for larger clusters (with an appropriately larger value for minimum_master_nodes).

The problem with minimum_master_nodes is that it does not work when the split brains are intersecting, as in my example above. Even on a larger cluster of, say, 7 nodes with minimum_master_nodes set to 4, all that needs to happen is for the 'right' two nodes to lose contact with one another (a master election has to take place) for the cluster to split.

Is there anything that can be done to detect the intersecting split on node (1)?

Would #1057 help?

Am I missing something obvious? :)

aphyr commented 10 years ago

Because ElasticSearch allocates IDs sequentially instead of using k-ordered flake IDs,

Scratch that, I think was wrong. Misread the wireshark traces; it looks like Elasticsearch is actually generating UUIDs and then discarding all the data on one side of the partition anyway. I assumed no sane database would do that and that it had to be due to record conflict. I think it's either generating the same UUIDs on both sides of the split, or it's throwing away data without looking to see if it's cleanly mergeable.

kimchy commented 10 years ago

@aphyr the way ES works is when minimum master nodes is breached, it will not allow for writes until it is resolved. The bug this issue points to is the problem we have in the mentioned case with handling minimum master nodes. It took us some time to build the infra to be able to reproduce it in our test infrastructure, and as mentioned, we are working on fixing scenario to properly handle the case.

shikhar commented 10 years ago

fwiw another discovery plugin that is built with avoiding split-brains in mind: https://github.com/shikhar/eskka

verifying this using jepsen is on the agenda

AeroNotix commented 10 years ago

@shikhar how's it going with the Jepsen testing? I could help you with that.

shikhar commented 10 years ago

@AeroNotix that'd be awesome! let's take this to https://github.com/shikhar/eskka/issues/6

brusic commented 10 years ago

Kevin Kluge at Elasticsearch informed me that they are perhaps not looking into Paxos/RAFT consensus, but are instead working on improving zen discovery. There is a branch on git with the improvements:

https://github.com/elasticsearch/elasticsearch/tree/feature/improve_zen

masumsoft commented 10 years ago

Is it possible to get rid of this situation by somehow dynamically updating the minimum_master_nodes configuration based on current state of the cluster?

AeroNotix commented 10 years ago

@brusic It seems very headstrong to be rolling your own distributed consensus algorithm.

AeroNotix commented 10 years ago

@masumsoft Explain further, what you're suggesting sounds like reducing the number of members required to create a cluster, I don't see how that fixes a split-brain.

ejsarge-gr commented 10 years ago

@brusic I'm very glad Elasticsearch is looking into this issue but I would agree with @AeroNotix. If the leader election algorithm isn't based on a peer-reviewed published algorithm why would we have trust that it is reliable in every case? The insight from @aphyr's Jepsen work is that distributed concurrency defects are incredibly hard to find via testing.

otisg commented 10 years ago

Not to get too philosophical here, but one could have said the same thing some 5 years ago when @kimchy started working on ES - why do it - there was already Solr and it was working perfectly find for pretty much everyone on the planet. +1 for innovation.

XANi commented 10 years ago

@otisg if it is actual innovation sure, but if there is already working and tested solution to a problem and if "invented" solution is not working very well to a point of making users lose their data because of it... just use tried and tested one, at least till your "invention" is proven and working

shikhar commented 10 years ago

@aphyr

I can confirm that partitions with nodes that can see both sides of the cluster reliably induce ElasticSearch split brain after about a hundred seconds

I can confirm that even when using

   :nemesis   (nemesis/partition-halves)

where there is a clear majority partition -- I am able to see multiple masters (e.g. n1,n2 report n2 is master; n3,n4,n5 report n5 is master). The test fails with acknowledged writes lost.

kimchy commented 10 years ago

FYI, the improved_zen branch already contains a fix for this issue, we are letting it bake as this is a delicate change, and we are working on adding more test scenarios (aside from the one detailed in this issue) to make sure. The plan is to aim at getting this into 1.3. We have not yet ran Jespen (which simulates the same scenario we already simulate in our test), but we will do it as well.

AeroNotix commented 10 years ago

@kimchy any comments on the underlying consensus algorithm?

nilsga commented 10 years ago

Did the improve_zen branch make it to 1.3?

kimchy commented 10 years ago

@nilsga no, we are making great progress on it, and the plan now is to try and get it to 1.4...

AeroNotix commented 10 years ago

@kimchy any comments on the underlying consensus algorithm?

kimchy commented 10 years ago

@AeroNotix the main change there is that we fixed the bug in our gossip protocol to elect a master while maintaing the minimum master nodes criteria. This will ensure (validated through our testing) that the mentioned split brain will not happen. This is the most urgent fix we are aiming to provide with improve_zen (also, thanks to the new simulated testing infra, we have uncovered several other bugs already, some fixed in 1.x/master, others only in improve_zen).

AeroNotix commented 10 years ago

Does the code in improve_zen branch still suffer from split-brains?

kimchy commented 10 years ago

@AeroNotix the fix is in, and not based on our current testing, but we are still writing and performing more tests, some resulting in more improvements to it (easy to see the commits being done on it): https://github.com/elasticsearch/elasticsearch/commits/feature/improve_zen

AeroNotix commented 10 years ago

Forgive me here but, pointing me at a branch to have me figure out your consensus algorithm makes things extremely difficult for me to learn whether I can base my usage of ElasticSearch on good grounds.

Do you have a formalized specification of your consensus algorithm so that I may learn more about it?

kimchy commented 10 years ago

@AeroNotix it wasn't clear to me what you are after, a proper documentation of the protocol we use for the leader election is also on our list of TODOs (zen discovery deals with what we call the cluster level state and leader election).

AeroNotix commented 10 years ago

Essentially I am after a non-code implementation of your protocol so that I may compare it to something which is formally verified.

mihasya commented 10 years ago

Really happy to see this ticket with code and recent updates as the 3rd search result for "elasticsearch jepsen" after reading @aphyr's post. It is REALLY REALLY REALLY good that you guys are taking this seriously and actually addressing it. It makes me feel good about the choice to use ES, even if I'm quite a bit more scared about it now than I was 3 hours ago :grin:. I have a couple of questions:

Now that these issues have been identified, is there any place that they are documented? Has any documentation been put together which correctly describes ES in CAP terms? e.g., not counting error responses as _A_vailable, indicating the scenarios in which _C_onsistency is compromised, etc? I would like to encourage you to be very honest and up-front about this stuff. It will not make the project look bad, but good on the contrary. People that don't care and just want to use it as a loose search engine will continue to pile data into it and likely won't even read that doc. People that DO care (like me) will appreciate the honesty. I may choose not to run ElasticSearch based on those characteristics, but IMO that's better for everyone than me running ES, losing data, then using something else anyway after burning a bunch of time and posting a bunch of angry bug reports.
What are the perceived advantages of ZenDisco over other more mature, well defined, and well tested leader election protocols? As an early user, abused, and extender of Cassandra (0.6 and on), I can tell you that getting cluster liveness right is simultaneously difficult and critical. Getting it wrong will cause your users much grief (and looks like already has). Are alternatives being explored? It looks like there's a ZooKeeper plugin available, but I've also seen comments indicating it's not supported. Is that true? Is there any hope that it'll one day fly again? I see this fork from an ES core team member with fairly recent commits that claims compatibility with 1.1 - is anyone using this successfully? Is it tested regularly along with ZenDisco? Is the plan to continue supporting it for future versions?
How can I/we help in this particular area? No promises, as we have a tiny tiny engineering team (one of the appeals of ES was precisely the ease of going from 0 to having the ability to throw JSON into it and peel out), but perhaps we can help test some alternative implementations? Maybe said ZK plugin?

pilvitaneli commented 10 years ago

@kimchy I've been running jepsen tests on improve_zen branch quite regularly and four nemeses (isolate-self-primaries-nemesis, nemesis/partition-random-halves, nemesis/partition-halves, nemesis/partitioner nemesis/bridge) seem to transiently fail, i.e. 2-4 runs out of ten seem to result in some lost documents. One nemesis (nemesis/partition-random-node) hasn't failed in a few thousand successive runs, so would consider that scenario to be fixed.

kimchy commented 10 years ago

@pilvitaneli thanks for the effort!, we have been running it as well on our end. In short, we verified that the split brain doesn't seem to happen anymore, but still need some work around the replication logic in addition to it to strengthen certain failure cases. We identified the case, but our hope is to push improve zen with the fix to split brain as soon as possible, and then progress on the replication aspect.

We are in the final stages of tests on improve_zen, now mainly doing 100s nodes tests type verifications, since improve zen now does a round of gossip on master failure, and we had to work on optimizing the resource usage in that case.

I hope that in the next few days we will publish a new resiliency status page (writing it now, took some time, summer... :) ). The resiliency status page would shed light and be a good place to aggregate all the effort going into any aspect of resiliency in ES (there is much work done except for this issue), on the work that has already been done, work that is in progress, and things that we know about still left to be done.

bleskes commented 10 years ago

I'm closing this issue, as it is solved, as specified, by the changes made in #7493. Of course, there is more work to be done and the effort continues.

Thx for all the input and discussion.

AeroNotix commented 10 years ago

@bleskes so it's 100% fixed?

bleskes commented 10 years ago

this issue (partial network splits causing split brain) is fixed now, yes.

AeroNotix commented 10 years ago

Interesting, will have to confirm that myself with Jepsen tests.

shikhar commented 10 years ago

still need some work around the replication logic in addition to it to strengthen certain failure cases

is there an issue(s) open for this?

bleskes commented 10 years ago

@AeroNotix sure, let me know what you run into. Do note though that Jepsen tests more then what stated in this issue. For example, how the document replication model.

@shikhar there more then one thing to do. I think the best way to follow the work is through the resiliency label.

mschirrmeister commented 10 years ago

Is there an eta when 1.4 is released, or will it even go into the next 1.3.x update?

kimchy commented 10 years ago

@shikhar things take a bit longer than expected, but expect issue(s) for the rest of the known work to be open in the next few days, as well as the status page I talked about (just came back from vacation personally :) ).

kimchy commented 10 years ago

@mschirrmeister this feature is not planned to be back ported to 1.3, its too big of a change. No concrete ETA for 1.4, hopefully we will have a release (possibly first in Beta form) in the next couple of weeks.

bleskes commented 10 years ago

@shikhar FYI - I opened a ticket for the issue we discussed above: https://github.com/elasticsearch/elasticsearch/issues/7572

shikhar commented 10 years ago

thanks @bleskes!

kelaban commented 9 years ago

@kimchy, has the behavior of minimum number of master nodes changed in the new implementation, stated here? As I currently understand the setting, if there is less than N nodes available no cluster will exist (reads or writes)

kimchy commented 9 years ago

@kelaban the "previous" behavior is the same as the current one, when N nodes are not available, then that side of the cluster becomes blocked. In the new implementation (1.4), there is an option to decide if reads will still be allowed on that cluster or not.

ghost commented 9 years ago

@kimchy Where can we find this option? Should it already be present in the 1.4 branch? We came across the split-brain issue and we want to know whether this is fixed in 1.4.

clintongormley commented 9 years ago

@fevers this is what you're looking for: http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.4/modules-discovery-zen.html#no-master-block

aphyr commented 9 years ago

this issue (partial network splits causing split brain) is fixed now, yes.

I'm not sure why this issue was closed--people keep citing it and saying the problem is solved, but the Jepsen test from earlier in this thread still fails. Partial network partitions (and, for that matter, clean network partitions, and single-node partitions, and single-node pauses) continue to result in split-brain and lost data, for both compare-and-set and document-creation tests. I don't think the changes from https://github.com/elastic/elasticsearch/pull/7493 were sufficient to solve the problem, though they may have improved the odds of successfully retaining data.

For instance, here's a test in which we induce randomized 120-second long intersecting partitions, for 600 seconds, with 10 seconds of complete connectivity in between each failure. This pattern resulted in 22/897 acknowledged documents being lost due to concurrent, conflicting primary nodes. You can reproduce this in Jepsen 7d0a718 by going to the elasticsearch directory and running lein test :only elasticsearch.core-test/create-bridge--may take a couple runs to actually trigger the race though.

bleskes commented 9 years ago

I'm not sure why this issue was closed

This issue, as it is stated, relates to have two master nodes elected during partial network split, despite of min_master_nodes. This issue should be solved now. The thinking is that we will open issues for different scenarios as they are discovered. An example is #7572 as well as your recent tickets (#10407 & #10426). Once we figure out the root cause of those failure (and the one mentioned in your previous comment) and if it turns out to be similar to this issue, it will of course be re-opened.

speedplane commented 8 years ago

Not directly on topic to this issue, but why is it so difficult to avoid/prevent this split brain issue? If there are two master nodes on a network (ie, a split brain configuration), why can't there be some protocol for the two masters to figure out which one should become a slave?

I imagine some mechanism would need to detect that the system is in a split-brain state, and then a heuristic would be applied to choose the real master (e.g., oldest running server, most number of docs, random choice, etc.). This probably takes work to do, but it does not seem too difficult.

hamiltop commented 8 years ago

Michael: Split brain occurs precisely because the two masters can't communicate. If they could they would resolve it.

On Thu, Feb 25, 2016 at 6:54 PM Michael Sander notifications@github.com wrote:

Not directly on topic to this issue, but why is it so difficult to avoid/prevent this split brain issue? If there are two master nodes on a network (ie, a split brain configuration), why can't there be some protocol for the two masters to figure out which one should become a slave?

I imagine some mechanism would need to detect that the system is in a split-brain state, and then a heuristic would be applied to choose the real master (e.g., oldest running server, most number of docs, random choice, etc.). This probably takes work to do, but it does not seem too difficult.

— Reply to this email directly or view it on GitHub https://github.com/elastic/elasticsearch/issues/2488#issuecomment-189087064 .

speedplane commented 8 years ago

Got it. Earlier this week two nodes in my cluster appeared to be fighting for who was the master of the cluster. They were both on the same network and I believe were in communication with each other, but they went back and forth over which was the master. I shut down one of the nodes, gave it five minutes, restarted that node, and everything was fine. I thought that this was a split brain issue, but I guess it may be something else.

jasontedor commented 8 years ago

Earlier this week two nodes in my cluster appeared to be fighting for who was the master of the cluster.

@speedplane Do you have exactly two master-eligible nodes? Do you have minimum master nodes set to two (if you're going to run with exactly two master-eligible nodes you should, although this means that your cluster becomes semi-unavailable if one of the masters faults; ideally if you have multiple master-eligible nodes you'll have at least three and have minimum master nodes set to a quorum of them)?

I thought that this was a split brain issue, but I guess it may be something else.

Split brain is when two nodes in a cluster are simultaneously acting as masters for that cluster.

speedplane commented 8 years ago

@jasontedor Yes, I had exactly two nodes, and minimum master nodes was set to one. I did this intentionally for the exact reason you described. It appeared that the two nodes were simultaneously acting as a master, but they were both in communication with each other, so shouldn't they be able to resolve it, as @hamiltop suggests?

jasontedor commented 8 years ago

Yes, I had exactly two nodes, and minimum master nodes was set to one.

@speedplane This is bad because it does subject you to split brain.

I did this intentionally for the exact reason you described.

That's not what I recommend. Either drop to one (and lose high-availability), or increase to three (and set minimum master nodes to two).

It appeared that the two nodes were simultaneously acting as a master, but they were both in communication with each other, so shouldn't they be able to resolve it, as @hamiltop suggests?

What evidence do you have that they were simultaneously acting as master? How do you know that they were in communication with each other? What version of Elasticsearch?

elastic / elasticsearch

minimum_master_nodes does not prevent split-brain if splits are intersecting #2488