etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.84k stars 9.77k forks source link

etcdctl cluster-health and member list commands do not work correctly #2711

Closed kelseyhightower closed 9 years ago

kelseyhightower commented 9 years ago

etcd server version

/opt/bin/etcd --version
etcd version 2.0.9

etcd client version

/usr/local/bin/etcdctl --version
etcdctl version 2.0.9

Start a 3 node etcd cluster

vmrun list
Total running VMs: 3
/Users/kelseyhightower/Documents/Virtual Machines.localized/core0.vmwarevm/core0.vmx
/Users/kelseyhightower/Documents/Virtual Machines.localized/core1.vmwarevm/core1.vmx
/Users/kelseyhightower/Documents/Virtual Machines.localized/core2.vmwarevm/core2.vmx
etcdctl cluster-health
cluster is healthy
member 5ae3067007f7fb85 is healthy
member 7931e79c0d8b47c5 is healthy
member 987146e8925f10e5 is healthy
etcdctl member list
5ae3067007f7fb85: name=etcd2 peerURLs=http://192.168.12.52:2380 clientURLs=http://192.168.12.52:2379
7931e79c0d8b47c5: name=etcd0 peerURLs=http://192.168.12.50:2380 clientURLs=http://192.168.12.50:2379
987146e8925f10e5: name=etcd1 peerURLs=http://192.168.12.51:2380 clientURLs=http://192.168.12.51:2379

Poweroff one of the etcd members

vmrun stop /Users/kelseyhightower/Documents/Virtual\ Machines.localized/core0.vmwarevm/core0.vmx
vmrun list
Total running VMs: 2
/Users/kelseyhightower/Documents/Virtual Machines.localized/core1.vmwarevm/core1.vmx
/Users/kelseyhightower/Documents/Virtual Machines.localized/core2.vmwarevm/core2.vmx

The member list commands fails

etcdctl -C http://192.168.12.50:2379,http://192.168.12.51:2379,http://192.168.12.52:2379 member list
context deadline exceeded

The cluster is reported healthy, but no nodes are marked unhealthy even though member 7931e79c0d8b47c5 is powered off.

etcdctl -C http://192.168.12.50:2379,http://192.168.12.51:2379,http://192.168.12.52:2379 cluster-health
cluster is healthy
member 5ae3067007f7fb85 is healthy
member 7931e79c0d8b47c5 is healthy
member 987146e8925f10e5 is healthy
yichengq commented 9 years ago

This is basically the same one as what we met in 2.0.3: https://github.com/coreos/etcd/issues/2340

mariusgrigaitis commented 9 years ago

Can confirm this.

durzo commented 9 years ago

Running v2.1.0-alpha.1, shutting down a node still has cluster-health returning that all members are healthy.

2015/06/24 09:29:30 etcdserver: failed to reach the peerURL(http://etcd2:7001) of member 7a9767de17ea4500 (Get http://etcd2:7001/version: net/http: request canceled while waiting for connection)

root@etcd3:~$ etcdctl cluster-health
cluster is healthy
member 7a9767de17ea4500 is healthy
member cb1f485859524c11 is healthy
member d555fc8f72be9146 is healthy
xiang90 commented 9 years ago

/cc @yichengq

yichengq commented 9 years ago

@durzo Did it changes to unhealthy finally? How long did you see this false healthy info?

So far, we know that the implementation has some delay(around minutes) on healthy status for hard-kill machine, and we plan to improve it in 2.2. Internal details are that etcd 2.0 sends MsgApp async on HTTP stream, which cannot reflect whether the receive side works.

kerk1v commented 6 years ago

Hello,

It seems behaviour is still the same in etcd v3.2.15, so I have no way a cluster operator can manually confirm the health of an etcd v3 cluster. Any ideas here to help on this, or any alternative etcdctl command to check cluster health in etcd3 than 'etcdctl member list'?

Edited:

It seems

ETCDCTL_API=3 etcdctl --cert=/etc/etcd_k8s/etcd.pem --key /etc/etcd_k8s/etcd-key.pem --i
nsecure-skip-tls-verify=true --endpoints=[https://master-1:2379,https://master-2:2370,https://master-3:2379] endpoint health

will check each of the nodes and report

https://master-3:2379 is healthy: successfully committed proposal: took = 3.665702ms
https://master-1:2379 is healthy: successfully committed proposal: took = 3.202865ms
https://master-2:2370 is unhealthy: failed to connect: dial tcp 192.168.33.102:2370: getsockopt: no route to host
Error:  unhealthy cluster

in case one of the etcd cluster members is down, however, this requires at least some knowledge of the etcd cluster by the operator.

dxlr8r commented 6 years ago

I have the same issue with:

etcdctl version: 3.2.22
API version: 3.2

I get "unhealthy cluster" as well when using "etcdctl member list" even though 2/3 is online.

I did however notice that when going from 2/3 to 1/3 there where no leader:

[foo1@bar ~]# etcdctl3 endpoint status
Failed to get the status of endpoint https://foo1:2379 (context deadline exceeded)
Failed to get the status of endpoint https://foo2:2379 (context deadline exceeded)
https://foo3:2379, 6074b97ec42826bg, 3.2.22, 16 MB, false, 760, 20792785

Note the false in the last line.

knraju483 commented 4 years ago

Use below command for V3.3.XX etcdctl etcdctl --endpoints=https://192.168.56.113:2379,https://192.168.56.118:2379,https://192.168.56.119:2379 --key-file="/etc/kubernetes/pki/etcd/client-key.pem" --cert-file="/etc/kubernetes/pki/etcd/client.pem" --ca-file="/etc/kubernetes/pki/etcd/ca.pem" member list -w table

Use below command for V3.4.7 etcdctl

etcdctl --endpoints=https://192.168.56.113:2379,https://192.168.56.118:2379,https://192.168.56.119:2379 --key="/etc/kubernetes/pki/etcd/client-key.pem" --cert="/etc/kubernetes/pki/etcd/client.pem" --cacert="/etc/kubernetes/pki/etcd/ca.pem" member list -w table +------------------+---------+----------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+----------+-----------------------------+-----------------------------+------------+ | 29338f91dec951c0 | started | master01 | https://192.168.56.113:2380 | https://192.168.56.113:2379 | false | | 438679b543748ad8 | started | master02 | https://192.168.56.118:2380 | https://192.168.56.118:2379 | false | | 48544942dc6b8509 | started | master03 | https://192.168.56.119:2380 | https://192.168.56.119:2379 | false | +------------------+---------+----------+-----------------------------+-----------------------------+------------+

bu3ny commented 4 years ago

I'm using latest release of etcd at the time of writing this comment(etcd-v3.4.9) and the following command works for me:

[root@master01 ~]#  etcdctl --endpoints=https://192.168.122.101:2379,https://192.168.122.102:2379,https://192.168.122.103:2379   --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem member list -w table
+------------------+---------+----------+------------------------------+------------------------------+------------+
|        ID        | STATUS  |   NAME   |          PEER ADDRS          |         CLIENT ADDRS         | IS LEARNER |
+------------------+---------+----------+------------------------------+------------------------------+------------+
| 148f9f6172465414 | started | master02 | https://192.168.122.102:2380 | https://192.168.122.102:2379 |      false |
| 79ad015295a746a9 | started | master01 | https://192.168.122.101:2380 | https://192.168.122.101:2379 |      false |
| f857eddf41ed1741 | started | master03 | https://192.168.122.103:2380 | https://192.168.122.103:2379 |      false |
+------------------+---------+----------+------------------------------+------------------------------+------------+