Etcd cluster losing quorum with 2/3 running pods

I am currently doing some failure testing with etcd-operator. 1) I have 3 etcd pods on their own nodes running with no issues. 2) I will bring down one node by stoping kubelet and docker. 3) Etcd operator will kill the pod after 5 minutes of the node being not ready (defined in its tolerations) 4) Etcd operator will then spin up a new pod on a healthy node

I've ran several tests on about 100 clusters and have hit my issues around ~30 times. Etcd operator will try and spin up a new pod and fail constantly. The pod will fail after 5 attempts and at that time we will lose quorum. The error we are seein from etcd is:

2018-06-05 21:16:35.868828 C | etcdmain: error validating peerURLs {ClusterID:424b3b1d72a035e9 Members:[&{ID:18602dd0a4a7c400 RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name:etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc ClientURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2379]}} &{ID:5482fd6ce2566e19 RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name:etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv ClientURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2379]}} &{ID:d405711199e936c6 RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name:etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx ClientURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2379]}} &{ID:dfb3691c3b5bce6e RaftAttributes:{PeerURLs:[https://etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh.etcd-2029ddb23ae74610b9af3b4bc87695b2.kubx-etcd-09.svc:2380]} Attributes:{Name: ClientURLs:[]}}] RemovedMemberIDs:[]}: member count is unequal

and the etcd-operator logs show:

time="2018-06-05T21:20:33Z" level=info msg="running members: etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:33Z" level=info msg="cluster membership: etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc,etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:33Z" level=info msg="removing one dead member" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:33Z" level=info msg="removing dead member \"etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh\"" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:38Z" level=info msg="Finish reconciling" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:20:38Z" level=error msg="failed to reconcile: remove member (etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh) failed: context deadline exceeded" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="Start reconciling" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="running members: etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="cluster membership: etcd-2029ddb23ae74610b9af3b4bc87695b2-5xczlcsbtx,etcd-2029ddb23ae74610b9af3b4bc87695b2-6sstmwbtnh,etcd-2029ddb23ae74610b9af3b4bc87695b2-zxxrhmtvjc,etcd-2029ddb23ae74610b9af3b4bc87695b2-2dxgkjv5rv" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=info msg="Finish reconciling" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster
time="2018-06-05T21:21:38Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd-2029ddb23ae74610b9af3b4bc87695b2 pkg=cluster

Somethings to note, when we are at 2/3 I am able to take a backup of the current state and then restore it, bringing the cluster back into the healthy state. The only issue is that we will have downtime while we lose quorum.

I saw this issue, and maybe it is related to what I am seeing? https://github.com/coreos/etcd-operator/issues/1856

coreos / etcd-operator

Etcd cluster losing quorum with 2/3 running pods #1969