Quentin-M / etcd-cloud-operator

Deploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.
Apache License 2.0
234 stars 42 forks source link

Cluster lost quorum and didn't recover automatically. #45

Closed empath75 closed 4 years ago

empath75 commented 5 years ago

health check for peer 1db1c9009069daa could not connect: dial tcp 10.194.214.86:2380: i/o timeout 2019-11-04 19:15:41.141040 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3) 2019-11-04 19:15:41.141076 W | etcdserver: not enough started members, rejecting member add {ID:9d2e3521595e019 RaftAttributes:{PeerURLs:[https://10.194.211.163:2380]} Attributes:{Name: ClientURLs:[]}}

A node terminated (the one with the timeout errors) and was replaced by the autoscaling group ( the one that was rejected), but the replacement couldn't join. The only way I was able to fix it was by stopping the eco container on all the instances at the same time.

Is this something that should have been corrected automatically or was there another way to recover it?

Quentin-M commented 4 years ago

Have not been able to respond here before, sorry about that. But that error message is expected. There is a configurable TTL, after which the dead instance will be removed from the quorum, allowing your new instance to join successfully. The default is "30s" at the project-level and "3min" on AWS, which gives enough time for AWS EC2 instances to potentially restart / re-join.