etcd-operator does not recover an etcd cluster if it loses quorum

dperique commented 6 years ago

I use the example in the README.md file and startup a 3 node etcd cluster. I then kubectl delete pod 2 of the 3 example-etc-cluster pods. the etcd-operator ends with:

time="2018-06-17T12:08:50Z" level=error msg="failed to reconcile: lost quorum" cluster-name=example-etcd-cluster pkg=cluster
time="2018-06-17T12:08:58Z" level=info msg="Start reconciling" cluster-name=example-etcd-cluster pkg=cluster
time="2018-06-17T12:08:58Z" level=info msg="running members: example-etcd-cluster-fzgs4jw6kx" cluster-name=example-etcd-cluster pkg=cluster
time="2018-06-17T12:08:58Z" level=info msg="cluster membership: example-etcd-cluster-bsz7gv9lsl,example-etcd-cluster-ckqpltrc2p,example-etcd-cluster-fzgs4jw6kx" cluster-name=example-etcd-cluster pkg=cluster
time="2018-06-17T12:08:58Z" level=info msg="Finish reconciling" cluster-name=example-etcd-cluster pkg=cluster
time="2018-06-17T12:08:58Z" level=error msg="failed to reconcile: lost quorum" cluster-name=example-etcd-cluster pkg=cluster

kubectl shows the last remaining etcd pod in completed state.

$ kubectl get po -o wide -a
NAME                              READY     STATUS      RESTARTS   AGE       IP              NODE
etcd-operator-768dc99865-wxzmc    1/1       Running     1          14h       10.233.64.94    dp-test-k8s-node-1
example-etcd-cluster-fzgs4jw6kx   0/1       Completed   0          51m       10.233.94.219   dp-test-k8s-node-4

To mitigate, I have to delete/re-create the etcd cluster (for non-test setups, folks will also have to restore the etcd database).

$ kubectl delete -f example/example-etcd-cluster.yaml
etcdcluster "example-etcd-cluster" delete
$ kubectl apply -f example/example-etcd-cluster.yaml
etcdcluster "example-etcd-cluster" created

davinchia commented 6 years ago

This is something I've noticed too, which caused a system wide failure for us when a node with a majority of our etcd pods were on got rebooted. We then switched over to hard VM instances since ETCD is not something we can afford to lose.

Is there a reason for the operator to not continue spinning up more pods in the event of something like this happening?

fanminshi commented 6 years ago

It has to do with raft consensus protocol which etcd is based. we can only add or remove node if there is a quorum. Suppose that the quorum is lost, we can't add a new node to the cluster. Hence, etcd-operator can't do anything once quorum is lost. However, if the quorum is re-established again because the failed node comes back, then etcd-operator then can continue to do its job. The lesson here is that to make sure that the quorum is never lost by having more nodes so that failure of few nodes doesn't make the cluster loses the quorum. More details: https://github.com/coreos/etcd/blob/master/Documentation/faq.md#why-an-odd-number-of-cluster-members

davinchia commented 6 years ago

I agree. I do think this makes more sense in the hard VM case, where the solution to losing quorum means bringing downed VM(s) back to life. Kube pods are far more ephemeral - if a pod is terminated it's not possible to ssh in and bring them back online (maybe it is and I'm mistaken). Because of this, I think its reasonable for etcd-operator to intervene by creating more pods so quorum can again be achieved. Conceptually, this action is the same as a human rescuing downed VMs in the non-Kube world, and would make the operator more robust. Having said that, I will say etcd-operator is easy to set up, and works well 98% of the time. These are my 2 cents, and you undoubtedly have more context than me as a contributor.

If I understand this correctly, the only current solution to this is to delete/recreate the etcd cluster, hopefully from a restore?

Also, does the operator make efforts to spread etcd pods among separate nodes?

fanminshi commented 6 years ago

If I understand this correctly, the only current solution to this is to delete/recreate the etcd cluster, hopefully from a restore? For now yes, since etcd operator can't bring a dead pod to be alive again. However, once PV is supported, we can probably bring up a new pod with dead pod PV. Then it is might be possible to reestablished the quorum in theory.

does the operator make efforts to spread etcd pods among separate nodes? I think the user has to determine how are pods are being spread across the via anti-affinity. I haven't tested this out myself.

alaypatel07 commented 5 years ago

@fanminshi @davinchia Interesting conversation, here are my 2 cents on the same:

If the quorum is lost, there is no point in spinning up a new pod. The quorum will never be regained once lost as explained by this FAQ, hence the operator leaves it on the admin to act on it.
Having said that, data from etcd can be read even after quorum is lost, as long as there is 1 node in the cluster. The admin can use the backup operator to immediately create a backup and then restore operator to resurrect the lost cluster. This task can be automated and performed by the cluster operator itself, but it will make the interface complex (all the specs of backup and restore operators will be required in the etcd CR spec to perform these actions). Instead, a separate operator can be developed with the sole purpose of resurrecting lost clusters. This would solve all the problems and keep the interface as simple as it is now.

Thoughts? @hexfusion

hexfusion commented 5 years ago

@alaypatel07 in general we can restore a cluster with a valid snapshot so yes having that as operator logic as a response to quorum loss is a valid option. But when it becomes more complicated is if all nodes go down for whatever reason. Because the underlying store is not persistent we have a gap. This gap could cause data loss depending on the time between last snapshot and cluster loss.

For this reason I am also exploring the use of PV in this failure case. So that the data store (PVC) can persist outside of the life of Pod. This also adds complexity so there is a trade off. I like where your head is at though and would like to talk about this further when I get back from KubeCon, thank you for the notes.

davinchia commented 5 years ago

@hexfusion we are actually doing exactly that right now by running etcd as a statefulset and configuring the pods to join an existing cluster correctly when quorum is lost.

I’m also at Kubecon so happy to do a quick sync up offline to share this with you.

hexfusion commented 5 years ago

@dperique I am right outside the showcase if your around ping me on k8s slack @hexfusion and we can meetup for a soda :)

alaypatel07 commented 5 years ago

@hexfusion I think there are two fundamental problems here and can be addressed separately. One is related to persistent storage as you mentioned and the other is related to restoring the quorum. What I was trying to say is that if a node goes down, even if the data is not persistent, the operator could take a snapshot of the data from nodes that are alive, and restore the quorum by restoring the cluster using that latest snapshot. A PVC can surely create a persistent data store, but I have my doubts as to how it would be helpful in restoring the quorum. Happy to discuss more on it whenever you are back.

hexfusion commented 5 years ago

@hexfusion I think there are two fundamental problems here and can be addressed separately. One is related to persistent storage as you mentioned and the other is related to restoring the quorum.

I was saying use the underlying data stores in PVC to restore quorum. So we are talking about the same problem, just different solutions?

What I was trying to say is that if a node goes down, even if the data is not persistent, the operator could take a snapshot of the data from nodes that are alive, and restore the quorum by restoring the cluster using that latest snapshot.

yes agreed

A PVC can surely create a persistent data store, but I have my doubts as to how it would be helpful in restoring the quorum. Happy to discuss more on it whenever you are back.

More details are here doc/design/persistent_volumes_etcd_data.md. In short each pod has PVC holding state. In the case of multi-node failure unmount PVCs restart all pods with same names/PVC. In this case etcd will not know anything happened and will operate as expected. Sorry if i was not clear on this before.

alaypatel07 commented 5 years ago

@hexfusion That makes more sense, thanks for clearing.

@davinchia Interested in seeing how you are using Statefulsets, wondering if I could get some more info on that?

raoofm commented 5 years ago

@davinchia interested in seeing too

dperique commented 5 years ago

@hexfusion I never made it to Kubecon 2018 -- but it would've been cool to meet.

dperique commented 5 years ago

So I see the solution is: if we lose the whole etcd cluster, just rebuild it. Presumably, if the admin cares, they will probably get alerted (assuming they have some kind of alerting for this condition) and then restore the etcd cluster.

I like this solution as there's not much you can do when quorum is just lost and thanks for the the PR @manojbadam.

I guess what's left is for me to test and close this.

dperique commented 5 years ago

I cloned the latest etcd-operator and created a 3 member etcd cluster. I then deleted each member quickly via kubectl delete po. I noticed that etcd-operator didn't re-create the cluster automatically -- I'm not sure if this is expected behavior because I didn't have a backup. Perhaps etcd-operator will only re-create the etcd cluster if there is a backup.

In my case, I'm creating the etcd cluster using ephemeral storage so if all quorum is lost and I have no backup, there's nothing to do but re-create the etcd cluster. To re-create the etcd cluster, you must do a delete and then apply of the original yaml.

EronWright commented 4 years ago

To make matters worse, I observe that the status block of the EtcdCluster resource isn't updated to reflect the degraded condition. For example, I have a cluster that is in this condition, yet its status is:

status:
  clientPort: 2379
  conditions:
  - lastTransitionTime: "2020-02-17T23:36:15Z"
    lastUpdateTime: "2020-02-17T23:36:15Z"
    reason: Cluster available
    status: "True"
    type: Available
  currentVersion: 3.3.18
  members:
    ready:
    - etcd-bdd54p7vkj
    - etcd-gcj7ctghs6
    - etcd-qpfj588c8f
  phase: Running
  serviceName: etcd-client
  size: 3
  targetVersion: ""

If there's something I must do to at least ensure accurate status, do tell.

davinchia commented 4 years ago

oops sorry for not replying guys, the statefulset solution was a modified version of https://sgotti.dev/post/kubernetes-persistent-etcd/

we did some tweaking for our use case, but the general principle applies.

can confirm it was cool to meet @hexfusion !

coreos / etcd-operator

etcd-operator does not recover an etcd cluster if it loses quorum #1972