coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 739 forks source link

Etcd cluster losing quorum with 2/3 running pods #1969

Open kodieGlosser opened 6 years ago

kodieGlosser commented 6 years ago

I am currently doing some failure testing with etcd-operator. 1) I have 3 etcd pods on their own nodes running with no issues. 2) I will bring down one node by stoping kubelet and docker. 3) Etcd operator will kill the pod after 5 minutes of the node being not ready (defined in its tolerations) 4) Etcd operator will then spin up a new pod on a healthy node

I've ran several tests on about 100 clusters and have hit my issues around ~30 times. Etcd operator will try and spin up a new pod and fail constantly. The pod will fail after 5 attempts and at that time we will lose quorum. The error we are seein from etcd is:

2018-06-05 21:16:35.868828 C | etcdmain: error validating peerURLs {ClusterID:424b3b1d72a035e9 Members:[&{ID:18602dd0a4a7c400 RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name:etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc ClientURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2379]}} &{ID:5482fd6ce2566e19 RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name:etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv ClientURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2379]}} &{ID:d405711199e936c6 RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name:etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx ClientURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2379]}} &{ID:dfb3691c3b5bce6e RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name: ClientURLs:[]}}] RemovedMemberIDs:[]}: member count is unequal

and the etcd-operator logs show:

time="2018-06-05T21:20:33Z" level=info msg="running members: etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:33Z" level=info msg="cluster membership: etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc,etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:33Z" level=info msg="removing one dead member" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:33Z" level=info msg="removing dead member \"etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh\"" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:38Z" level=info msg="Finish reconciling" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:38Z" level=error msg="failed to reconcile: remove member (etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh) failed: context deadline exceeded" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="Start reconciling" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="running members: etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="cluster membership: etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx,etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc,etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="Finish reconciling" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster

Somethings to note, when we are at 2/3 I am able to take a backup of the current state and then restore it, bringing the cluster back into the healthy state. The only issue is that we will have downtime while we lose quorum.

I saw this issue, and maybe it is related to what I am seeing? https://github.com/coreos/etcd-operator/issues/1856

alaypatel07 commented 5 years ago

@kodieGlosser I think the following sums up accurately what's happening with the cluster,

If a 3-member cluster has 1 downed member, it can still make forward progress because the quorum is 2 and 2 members are still live. However, adding a new member to a 3-member cluster will increase the quorum to 3 because 3 votes are required for a majority in a 4 membered cluster. Since the quorum increased, this extra member buys nothing in terms of fault tolerance; the cluster is still one node failure away from being unrecoverable. 3 node majority is never achieved and eventually, opeartor will render the cluster useless

If you see the logs, the etcd operator was not able to remove the dead member, and in the third line from last, you could see 4 members in the cluster. The thing I am not able to figure out is why is the operator not able to remove the dead member when it still has a quorum of 2 required for the 3 member cluster.