Open kodieGlosser opened 6 years ago
@kodieGlosser I think the following sums up accurately what's happening with the cluster,
If a 3-member cluster has 1 downed member, it can still make forward progress because the quorum is 2 and 2 members are still live. However, adding a new member to a 3-member cluster will increase the quorum to 3 because 3 votes are required for a majority in a 4 membered cluster. Since the quorum increased, this extra member buys nothing in terms of fault tolerance; the cluster is still one node failure away from being unrecoverable. 3 node majority is never achieved and eventually, opeartor will render the cluster useless
If you see the logs, the etcd operator was not able to remove the dead member, and in the third line from last, you could see 4 members in the cluster. The thing I am not able to figure out is why is the operator not able to remove the dead member when it still has a quorum of 2 required for the 3 member cluster.
I am currently doing some failure testing with etcd-operator. 1) I have 3 etcd pods on their own nodes running with no issues. 2) I will bring down one node by stoping kubelet and docker. 3) Etcd operator will kill the pod after 5 minutes of the node being not ready (defined in its tolerations) 4) Etcd operator will then spin up a new pod on a healthy node
I've ran several tests on about 100 clusters and have hit my issues around ~30 times. Etcd operator will try and spin up a new pod and fail constantly. The pod will fail after 5 attempts and at that time we will lose quorum. The error we are seein from etcd is:
and the etcd-operator logs show:
Somethings to note, when we are at 2/3 I am able to take a backup of the current state and then restore it, bringing the cluster back into the healthy state. The only issue is that we will have downtime while we lose quorum.
I saw this issue, and maybe it is related to what I am seeing? https://github.com/coreos/etcd-operator/issues/1856