Open technicianted opened 5 years ago
If you don't care about availability, doesn't a simple Deployment with replicas: 1
do the job?
But we do care about availability:
I would like to see if the operator can be tweaked (with a flag perhaps) to be in high-availability/low-durability mode where the cluster must be up at all times, even if it must be reconstructed.
Right...
Still, it doesn't sound like you lose anything by having the defaults, ensuring HA and durability. Unless your disk is really slow and you don't want to save to disk, keep all in memory, in which case perhaps Redis is a better choice.
I'm just a 3rd party, not a maintainer, just stumbled into your issue, but it seems to me that there's not much to be gained by having this option. And more options = more code paths to be tested, every option must be justified by its gains.
I would like to see if the operator can be tweaked (with a flag perhaps) to be in high-availability/low-durability mode where the cluster must be up at all times, even if it must be reconstructed.
@technicianted So are you saying if the cluster loses quorum then lets not worry about restoring data lets just recreate a new cluster and you the user will worry about data recovery.
In our use cases this may be desirable, yes. Losing the data degrades service for a while but eventually auto-reconstructs over time.
It happened to us during quay.io outage while applying OS patches to the cluster.
We had to implement another service that watches over the cluster. If it loses quorum, it patches the CR and restarts the operator forcing rebootstrapping as we needed to keep the service object.
Sounds like a use case for metacontroller https://github.com/GoogleCloudPlatform/metacontroller but you want to have it incorporated within etcd-operator itself?
We have a use-case where we use etcd only for distributed coordination. Any data we store in it is purely ephemeral and can be reconstructed by users when it is up again. In fact, we use etcd leases to guarantee that data is always ephemeral by design.
Looking at the code (and design docs), it seems that all focus has been on data robustness and durability by [manual] recovery of etcd from backups, or not do anything at all (loss of quorum, total dead members, etc).
I would like to see if the operator can be tweaked (with a flag perhaps) to be in high-availability/low-durability mode where the cluster must be up at all times, even if it must be reconstructed.
Does anyone else have, or is interested in the same use-case?