Cluster lost quorum and didn't recover automatically.

Quentin-M / etcd-cloud-operator

Deploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.

Apache License 2.0

234 stars 42 forks source link

health check for peer 1db1c9009069daa could not connect: dial tcp 10.194.214.86:2380: i/o timeout 2019-11-04 19:15:41.141040 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3) 2019-11-04 19:15:41.141076 W | etcdserver: not enough started members, rejecting member add {ID:9d2e3521595e019 RaftAttributes:{PeerURLs:[https://10.194.211.163:2380]} Attributes:{Name: ClientURLs:[]}}

A node terminated (the one with the timeout errors) and was replaced by the autoscaling group ( the one that was rejected), but the replacement couldn't join. The only way I was able to fix it was by stopping the eco container on all the instances at the same time.

Is this something that should have been corrected automatically or was there another way to recover it?

Quentin-M / etcd-cloud-operator

Cluster lost quorum and didn't recover automatically. #45