coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

rack / failure domain awareness for total disaster recovery #645

Open heyitsanthony opened 7 years ago

heyitsanthony commented 7 years ago

etcd docs suggest running with an odd number of nodes because even membership does not improve tolerance of node failure. However, an even number of nodes can guarantee complete recovery across failure domains. If one domain is destroyed, quorum is permanently lost, but provided the members are evenly split between domains, the other domain is guaranteed to have at least one member with complete knowledge of the cluster state before failure. Therefore the cluster can be rebuilt from the other domain without losing a single proposal.

This kind of failure recovery is tedious to do by hand, but not terribly complex; software can do it with ease.

xiang90 commented 7 years ago

Interesting idea. etcd operator is not rack or dc awareness right now. we should explore this idea when we get there.