Open dperique opened 6 years ago
This is something I've noticed too, which caused a system wide failure for us when a node with a majority of our etcd pods were on got rebooted. We then switched over to hard VM instances since ETCD is not something we can afford to lose.
Is there a reason for the operator to not continue spinning up more pods in the event of something like this happening?
It has to do with raft consensus protocol which etcd is based. we can only add or remove node if there is a quorum. Suppose that the quorum is lost, we can't add a new node to the cluster. Hence, etcd-operator can't do anything once quorum is lost. However, if the quorum is re-established again because the failed node comes back, then etcd-operator then can continue to do its job. The lesson here is that to make sure that the quorum is never lost by having more nodes so that failure of few nodes doesn't make the cluster loses the quorum. More details: https://github.com/coreos/etcd/blob/master/Documentation/faq.md#why-an-odd-number-of-cluster-members
I agree. I do think this makes more sense in the hard VM case, where the solution to losing quorum means bringing downed VM(s) back to life. Kube pods are far more ephemeral - if a pod is terminated it's not possible to ssh in and bring them back online (maybe it is and I'm mistaken). Because of this, I think its reasonable for etcd-operator to intervene by creating more pods so quorum can again be achieved. Conceptually, this action is the same as a human rescuing downed VMs in the non-Kube world, and would make the operator more robust. Having said that, I will say etcd-operator is easy to set up, and works well 98% of the time. These are my 2 cents, and you undoubtedly have more context than me as a contributor.
If I understand this correctly, the only current solution to this is to delete/recreate the etcd cluster, hopefully from a restore?
Also, does the operator make efforts to spread etcd pods among separate nodes?
If I understand this correctly, the only current solution to this is to delete/recreate the etcd cluster, hopefully from a restore?
For now yes, since etcd operator can't bring a dead pod to be alive again. However, once PV is supported, we can probably bring up a new pod with dead pod PV. Then it is might be possible to reestablished the quorum in theory.
does the operator make efforts to spread etcd pods among separate nodes?
I think the user has to determine how are pods are being spread across the via anti-affinity. I haven't tested this out myself.
@fanminshi @davinchia Interesting conversation, here are my 2 cents on the same:
backup operator
to immediately create a backup and then restore operator
to resurrect the lost cluster. This task can be automated and performed by the cluster operator itself, but it will make the interface complex (all the specs of backup and restore operators will be required in the etcd CR spec to perform these actions). Instead, a separate operator can be developed with the sole purpose of resurrecting lost clusters. This would solve all the problems and keep the interface as simple as it is now.Thoughts? @hexfusion
@alaypatel07 in general we can restore a cluster with a valid snapshot so yes having that as operator logic as a response to quorum loss is a valid option. But when it becomes more complicated is if all nodes go down for whatever reason. Because the underlying store is not persistent we have a gap. This gap could cause data loss depending on the time between last snapshot and cluster loss.
For this reason I am also exploring the use of PV in this failure case. So that the data store (PVC) can persist outside of the life of Pod. This also adds complexity so there is a trade off. I like where your head is at though and would like to talk about this further when I get back from KubeCon, thank you for the notes.
@hexfusion we are actually doing exactly that right now by running etcd as a statefulset and configuring the pods to join an existing cluster correctly when quorum is lost.
I’m also at Kubecon so happy to do a quick sync up offline to share this with you.
@dperique I am right outside the showcase if your around ping me on k8s slack @hexfusion and we can meetup for a soda :)
@hexfusion I think there are two fundamental problems here and can be addressed separately. One is related to persistent storage as you mentioned and the other is related to restoring the quorum. What I was trying to say is that if a node goes down, even if the data is not persistent, the operator could take a snapshot of the data from nodes that are alive, and restore the quorum by restoring the cluster using that latest snapshot. A PVC can surely create a persistent data store, but I have my doubts as to how it would be helpful in restoring the quorum. Happy to discuss more on it whenever you are back.
@hexfusion I think there are two fundamental problems here and can be addressed separately. One is related to persistent storage as you mentioned and the other is related to restoring the quorum.
I was saying use the underlying data stores in PVC to restore quorum. So we are talking about the same problem, just different solutions?
What I was trying to say is that if a node goes down, even if the data is not persistent, the operator could take a snapshot of the data from nodes that are alive, and restore the quorum by restoring the cluster using that latest snapshot.
yes agreed
A PVC can surely create a persistent data store, but I have my doubts as to how it would be helpful in restoring the quorum. Happy to discuss more on it whenever you are back.
More details are here doc/design/persistent_volumes_etcd_data.md. In short each pod has PVC holding state. In the case of multi-node failure unmount PVCs restart all pods with same names/PVC. In this case etcd will not know anything happened and will operate as expected. Sorry if i was not clear on this before.
@hexfusion That makes more sense, thanks for clearing.
@davinchia Interested in seeing how you are using Statefulsets, wondering if I could get some more info on that?
@davinchia interested in seeing too
@hexfusion I never made it to Kubecon 2018 -- but it would've been cool to meet.
So I see the solution is: if we lose the whole etcd cluster, just rebuild it. Presumably, if the admin cares, they will probably get alerted (assuming they have some kind of alerting for this condition) and then restore the etcd cluster.
I like this solution as there's not much you can do when quorum is just lost and thanks for the the PR @manojbadam.
I guess what's left is for me to test and close this.
I cloned the latest etcd-operator and created a 3 member etcd cluster.
I then deleted each member quickly via kubectl delete po
.
I noticed that etcd-operator didn't re-create the cluster automatically -- I'm not sure if this is expected behavior because I didn't have a backup.
Perhaps etcd-operator will only re-create the etcd cluster if there is a backup.
In my case, I'm creating the etcd cluster using ephemeral storage so if all quorum is lost and I have no backup, there's nothing to do but re-create the etcd cluster. To re-create the etcd cluster, you must do a delete and then apply of the original yaml.
To make matters worse, I observe that the status block of the EtcdCluster
resource isn't updated to reflect the degraded condition. For example, I have a cluster that is in this condition, yet its status is:
status:
clientPort: 2379
conditions:
- lastTransitionTime: "2020-02-17T23:36:15Z"
lastUpdateTime: "2020-02-17T23:36:15Z"
reason: Cluster available
status: "True"
type: Available
currentVersion: 3.3.18
members:
ready:
- etcd-bdd54p7vkj
- etcd-gcj7ctghs6
- etcd-qpfj588c8f
phase: Running
serviceName: etcd-client
size: 3
targetVersion: ""
If there's something I must do to at least ensure accurate status, do tell.
oops sorry for not replying guys, the statefulset solution was a modified version of https://sgotti.dev/post/kubernetes-persistent-etcd/
we did some tweaking for our use case, but the general principle applies.
can confirm it was cool to meet @hexfusion !
I use the example in the README.md file and startup a 3 node etcd cluster. I then
kubectl delete pod
2 of the 3 example-etc-cluster pods. the etcd-operator ends with:kubectl shows the last remaining etcd pod in completed state.
To mitigate, I have to delete/re-create the etcd cluster (for non-test setups, folks will also have to restore the etcd database).