[RFC] Have etcd-operator manage completely ephemeral etcd clusters

coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes

https://coreos.com/blog/introducing-the-etcd-operator.html

Apache License 2.0

1.75k stars 739 forks source link

[RFC] Have etcd-operator manage completely ephemeral etcd clusters #2032

Open technicianted opened 5 years ago

technicianted commented 5 years ago

We have a use-case where we use etcd only for distributed coordination. Any data we store in it is purely ephemeral and can be reconstructed by users when it is up again. In fact, we use etcd leases to guarantee that data is always ephemeral by design.

Looking at the code (and design docs), it seems that all focus has been on data robustness and durability by [manual] recovery of etcd from backups, or not do anything at all (loss of quorum, total dead members, etc).

I would like to see if the operator can be tweaked (with a flag perhaps) to be in high-availability/low-durability mode where the cluster must be up at all times, even if it must be reconstructed.

Does anyone else have, or is interested in the same use-case?

gjcarneiro commented 5 years ago

If you don't care about availability, doesn't a simple Deployment with replicas: 1 do the job?

technicianted commented 5 years ago

But we do care about availability:

I would like to see if the operator can be tweaked (with a flag perhaps) to be in high-availability/low-durability mode where the cluster must be up at all times, even if it must be reconstructed.

gjcarneiro commented 5 years ago

Right...

Still, it doesn't sound like you lose anything by having the defaults, ensuring HA and durability. Unless your disk is really slow and you don't want to save to disk, keep all in memory, in which case perhaps Redis is a better choice.

I'm just a 3rd party, not a maintainer, just stumbled into your issue, but it seems to me that there's not much to be gained by having this option. And more options = more code paths to be tested, every option must be justified by its gains.

hexfusion commented 5 years ago

I would like to see if the operator can be tweaked (with a flag perhaps) to be in high-availability/low-durability mode where the cluster must be up at all times, even if it must be reconstructed.

@technicianted So are you saying if the cluster loses quorum then lets not worry about restoring data lets just recreate a new cluster and you the user will worry about data recovery.

technicianted commented 5 years ago

In our use cases this may be desirable, yes. Losing the data degrades service for a while but eventually auto-reconstructs over time.

It happened to us during quay.io outage while applying OS patches to the cluster.

We had to implement another service that watches over the cluster. If it loses quorum, it patches the CR and restarts the operator forcing rebootstrapping as we needed to keep the service object.

nvtkaszpir commented 5 years ago

Sounds like a use case for metacontroller https://github.com/GoogleCloudPlatform/metacontroller but you want to have it incorporated within etcd-operator itself?