Etcd Members don't have a persistent storage identity

yanniszark commented 5 years ago

Steps to reproduce:

Create an EtcdCluster that uses PersistentVolumes

apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
name: "example-etcd-cluster"
spec:
size: 3
pod:
persistentVolumeClaimSpec:
  storageClassName: default
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Delete a member of the cluster: kubectl delete <pod_name>
Etcd operator creates a new Pod to replace the deleted one

Video Footage: https://www.youtube.com/watch?v=pYyqDs1piG8&feature=youtu.be

What I expect to happen: The new member should get the PersistentVolume of the deleted one.

What happens: The new member got a new, empty PersistentVolume.

Why this is a problem: Pod deletes happen in Kubernetes even without user error. For example, draining a node will cause Pods on the node to be deleted. New members of etcd should get the PV of an old member, if it exists. StatefulSet already handles this.

Arau commented 5 years ago

Same here, I'd like to be able to recover the cluster without restoring from a backup (sometimes not possible to restore valid data) even with manual intervention. I could achieve that with proper persistent storage.

Would it be ok to start a PR using Statefulsets or at least build reclaim logic from a PVC?

yanniszark commented 5 years ago

A PR that swaps the current custom controller with StatefulSets would probably require a major refactor of the codebase. However, StatefulSets have come a long way and I feel they are a good fit for running etcd.

This operator was developed during the very early Kubernetes days and had to implement a lot of things that are now included out-of-the-box with StatefulSets. Extending the current code to handle this scenario could be done, but it is something that StatefulSets already do successfully and is well-tested.

I believe it would be a better solution to start clean and leverage StatefulSets as well as a modern framework for writing operators, like kubebuilder or operator-sdk. It would greatly simplify the current codebase and make it easier for new people to contribute.

I'd love to hear some opinions on this from contributors! :smile: @hexfusion @hongchaodeng @xiang90

raoofm commented 5 years ago

I agree with @yanniszark. Also I see most of the organizations are anyways relying on StatefulSets features anyhow with custom bootstrapper and teardown scripts.

hexfusion commented 5 years ago

/cc @alaypatel07

hexfusion commented 5 years ago

I believe it would be a better solution to start clean and leverage StatefulSets as well as a modern framework for writing operators, like kubebuilder or operator-sdk. It would greatly simplify the current codebase and make it easier for new people to contribute.

I think that this is really the question, at which point do we move to SDK framework?

yanniszark commented 5 years ago

It's true that I'm mixing up 2 orthogonal things here:

The issue is that deleting a Pod results in the loss of its PV, which should not happen.
The proposed solution is to use StatefulSets, to solve the problem and simplify the code. This solution requires major refactoring.
A new operator-framework would greatly simplify the code and offer new features (dynamic client, OpenAPI validation, admission webhooks). This requires major refactoring.

Since both require major refactoring and both simplify the code greatly, I'm kind of grouping them together, which may be kinda wrong.

alaypatel07 commented 5 years ago

@hexfusion I think now would be a good time move to SDK. Especially because, as @yanniszark pointed out, it would give us an opportunity to simplify the codebase.

@yanniszark @raoofm I have developed a proof of concept etcd operator using the operatorSDK's --type=Ansible to check any issues around using StatefulSets for managing etcd. Please feel free to check it out, here, and share thoughts. Thanks

yanniszark commented 5 years ago

I'm not too sure about using Ansible. This would present an extra barrier for existing contributors since they would also have to learn how Ansible works. In addition, most Operators out there are written in Go, etcd is written in Go and because of that I suspect that the CoreOS team working on etcd is made of people working with Go primarily.

Because of the aforementioned reasons, I think an implementation in Go makes more sense for this project.

alaypatel07 commented 5 years ago

@yanniszark I am sorry if it sounded like I was suggesting to refactor this project in Ansible, it would be an insane thing to do. I prototyped the POC in Ansible, to figure out the corner cases of using the StatefulSet resource.

Refactoring this project with SDK and StatefulSet will be a lot of work. Just wanted to have a quick jab at the pitfalls of the approach for folks to try out before a similar implementation for this project can be planned.

yanniszark commented 5 years ago

@alaypatel07 Ahh makes much more sense now! That's great, please do share your findings! :smile: Did you encounter any major roadblocks?

alaypatel07 commented 5 years ago

For some reason, I cannot get readinessProbes to work. My understanding is that for a 3 pod cluster, the StatefulSet controller only rolls out pod-1 if pod-0 is ready and pod-0's readinessProbe will not succeed until all the three pods are up and have formed a quorum, leading to a deadlock. I could be wrong, feel free to try it out.
The startup script is a little unreadable IMO, but it can be easily taken out and implemented as an initContainer. The logic embedded(hopefully written in go) in the initContainers can be tested and hence, it is a predictable way of doing things.
Initially, I used the preStop hook to remove the pod as an etcd member before deleting, but I noticed that on kubectl delete pods -l app=etcd all the pods will be removed from the etcd cluster. Following that, even if the StatefulSet controller re-creates the pods, the cluster can be in an unpredictable state. Hence, I think removing the member from the operator is a better way than using preStop hook. It will keep the cluster stable even on an accidental delete of pods as well as pods being rescheduled (because of node draining for instance).

Overall, I would say there isn't a real roadblock(up to now) that would hamper the use of StatefulSet. Having said that, the refactor is a huge task, we need to be careful and think this one through. I am still trying to figure out other major issues like deploying cluster using TLS certs, making Backup and Restore work, etc. Once I feel confident, I would go ahead and sum up the key learnings in a proposal.

BlueBlue-Lee commented 5 years ago

Make pod name stable like statefulset is the key point.

coreos / etcd-operator

Etcd Members don't have a persistent storage identity #2049