Quentin-M / etcd-cloud-operator

Deploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.
Apache License 2.0
234 stars 42 forks source link

Document state machine & algorithms #19

Open Quentin-M opened 5 years ago

ironcladlou commented 3 years ago

As part of evaluating the project, I drew up this state diagram. Curious to get your feedback on it. Really interesting approach to operating the etcd cluster — thanks!

etcd-cloud-operator drawio

Quentin-M commented 3 years ago

Hey Dan!

Thanks for making this diagram!

State machines are a beautiful way of managing such systems as they are easy to design, exhaustive and predictable by nature, so I like designing systems like so. Around the time I started working on this project, I was at CoreOS and was chatting with Xiang and the other etcd authors on the overall idea & concepts. The origin of the project comes from large production outages at some CoreOS/Tectonic customers who did not have good etcd setups (not self-healing, not enough monitoring, no backups).. and noticed that most other providers/open-source projects relying on etcd out there also do not have that, either because priorities/laziness or simply due to lack of knowledge/expertise on how to run etcd properly.

Since then, and although I do not make frequent updates to the operator, we've been using it extensively at BitMEX (as-is, without modifications), for both Kubernetes and non-Kubernetes production use-cases, mostly on AWS. Hope it can be of help for your use-cases too.