etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.85k stars 9.77k forks source link

etcdctl member remove fails with #13629

Closed squ94wk closed 2 years ago

squ94wk commented 2 years ago

Problem

The member list shows 12 members, 4 of which are valid and also online. It was 5 before and I removed a valid member, which I previously stopped to replace it. This succeeded. When I try to add it back I get an error that the cluster is unhealthy. If I try to remove the invalid members, etcdctl says the member wasn't found.

ubuntu@etcd1-k8s:~$ etcdctl member list
b26aceda033c826, started, etcd2-k8s, https://192.168.0.71:2380, https://192.168.0.71:2379, false
c6372107d0777d2, started, etcd3-k8s, https://192.168.0.34:2380, https://192.168.0.34:2379, false
18ada2f5a3b3fdf3, started, etcd05-restore, https://192.168.0.26:2380, https://192.168.0.26:2379, false
24aef85f917105ee, started, etcd4-k8s, https://192.168.0.146:2380, https://192.168.0.146:2379, false
25a17d0cf397ff2a, started, etcd0-k8s, https://192.168.0.5:2380, https://192.168.0.5:2379, false
3ae8cf204f312062, started, etcd1-k8s, https://192.168.0.117:2380, https://192.168.0.117:2379, false
9c27af115d153949, started, etcd02-restore, https://192.168.0.28:2380, https://192.168.0.28:2379, false
affd478dfebabc4b, started, etcd01-restore, https://192.168.0.25:2380, https://192.168.0.25:2379, false
b3a1c8694f025b7f, started, etcd2-k8s, https://192.168.0.10:2380, https://192.168.0.10:2379, false
b3e330dca330e585, started, etcd04-restore, https://192.168.0.22:2380, https://192.168.0.22:2379, false
e4c0813c987fdb4c, started, etcd03-restore, https://192.168.0.21:2380, https://192.168.0.21:2379, false
f7d18cee4c463ffd, started, etcd1-k8s, https://192.168.0.4:2380, https://192.168.0.4:2379, false
ubuntu@etcd1-k8s:~$ etcdctl member add etcd0-k8s --peer-urls=https://192.168.0.121:2380
{"level":"warn","ts":"2022-01-20T14:06:21.599+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003b0380/#initially=[https://192.168.0.117:2379]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
Error: etcdserver: unhealthy cluster
ubuntu@etcd1-k8s:~$ etcdctl member remove 25a17d0cf397ff2a
{"level":"warn","ts":"2022-01-20T14:01:38.466+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000422380/#initially=[https://192.168.0.117:2379]","attempt":0,"error":"rpc error: code = NotFound desc = etcdserver: member not found"}
Error: etcdserver: member not found

I understand we cannot force the removal of a member nor does there seem to be a key in etcd that we could remove to remove the member "manually". I actually don't know where the "-restore" members come from, the other ones seem like the machine's IPs have changed.

Other than this the cluster seems healthy, also at 4/5 members online.

We don't know how to reproduce the issue, but we can see it in a handful of our etcd clusters. Is there a way to remove the members without restoring a new cluster from a snapshot?

Additional information

ubuntu@etcd1-k8s:~$ etcd --version
etcd Version: 3.5.0
Git SHA: 946a5a6f2
Go Version: go1.16.3
Go OS/Arch: linux/amd64
serathius commented 2 years ago

It's known issue in v3.5.0, please upgrade to v3.5.1

serathius commented 2 years ago

https://github.com/etcd-io/etcd/issues/13196

squ94wk commented 2 years ago

Thank you very much for the swift reply.