Closed saj closed 10 years ago
What evidence do you have that they were simultaneously acting as master?
In the Big Desk plugin, the little star next to node name kept on bouncing back and forth between my two nodes (see screenshot).
How do you know that they were in communication with each other?
I don't think I explicitly tested whether one could contact the other, but I was able to ssh into both, they were on the same network, and there did not appear to be any network issues.
What version of Elasticsearch?
1.7.3
In the Big Desk plugin, the little star next to node name kept on bouncing back and forth between my two nodes (see screenshot).
@speedplane I'm not familiar with the Big Desk plugin, sorry. Let's just assume that it's correct and as you say. Have you checked the logs or any other monitoring for repeated long-running garbage collections pauses on both of these nodes?
I don't think I explicitly tested whether one could contact the other, but I was able to ssh into both, they were on the same network, and there did not appear to be any network issues.
Networks are fickle things but I do suspect something else here.
1.7.3
Thanks.
@speedplane "2 node situation" is inherently hard to deal with because there is no one metric iy could be decided which one should be shot down.
"Most written to" or "last written to" doesnt really mean much and in most cases alerting that something is wrong is preferable to "just throw away whatever other node had".
That is why a lot of distributed software recommends at least 3 nodes, because with 3 there is always majority, so you can set it up to only allow requests if at least n/2+1
nodes are up
G'day,
I'm using ElasticSearch 0.19.11 with the unicast Zen discovery protocol.
With this setup, I can easily split a 3-node cluster into two 'hemispheres' (continuing with the brain metaphor) with one node acting as a participant in both hemispheres. I believe this to be a significant problem, because now
minimum_master_nodes
is incapable of preventing certain split-brain scenarios.Here's what my 3-node test cluster looked like before I broke it:
Here's what the cluster looked like after simulating a communications failure between nodes (2) and (3):
Here's what seems to have happened immediately after the split:
zen-disco-node_failed
...reason failed to ping
)At this point, I can't say I know what to expect to find on node (1). If I query both masters for a list of nodes, I see node (1) in both clusters.
Let's look at
minimum_master_nodes
as it applies to this test cluster. Assume I had setminimum_master_nodes
to 2. Had node (3) been completely isolated from nodes (1) and (2), I would not have run into this problem. The left hemisphere would have enough nodes to satisfy the constraint; the right hemisphere would not. This would continue to work for larger clusters (with an appropriately larger value forminimum_master_nodes
).The problem with
minimum_master_nodes
is that it does not work when the split brains are intersecting, as in my example above. Even on a larger cluster of, say, 7 nodes withminimum_master_nodes
set to 4, all that needs to happen is for the 'right' two nodes to lose contact with one another (a master election has to take place) for the cluster to split.Is there anything that can be done to detect the intersecting split on node (1)?
Would #1057 help?
Am I missing something obvious? :)