Closed kelseyhightower closed 9 years ago
This is basically the same one as what we met in 2.0.3: https://github.com/coreos/etcd/issues/2340
Can confirm this.
Running v2.1.0-alpha.1, shutting down a node still has cluster-health returning that all members are healthy.
2015/06/24 09:29:30 etcdserver: failed to reach the peerURL(http://etcd2:7001) of member 7a9767de17ea4500 (Get http://etcd2:7001/version: net/http: request canceled while waiting for connection)
root@etcd3:~$ etcdctl cluster-health
cluster is healthy
member 7a9767de17ea4500 is healthy
member cb1f485859524c11 is healthy
member d555fc8f72be9146 is healthy
/cc @yichengq
@durzo Did it changes to unhealthy finally? How long did you see this false healthy info?
So far, we know that the implementation has some delay(around minutes) on healthy status for hard-kill machine, and we plan to improve it in 2.2. Internal details are that etcd 2.0 sends MsgApp async on HTTP stream, which cannot reflect whether the receive side works.
Hello,
It seems behaviour is still the same in etcd v3.2.15, so I have no way a cluster operator can manually confirm the health of an etcd v3 cluster. Any ideas here to help on this, or any alternative etcdctl command to check cluster health in etcd3 than 'etcdctl member list'?
Edited:
It seems
ETCDCTL_API=3 etcdctl --cert=/etc/etcd_k8s/etcd.pem --key /etc/etcd_k8s/etcd-key.pem --i
nsecure-skip-tls-verify=true --endpoints=[https://master-1:2379,https://master-2:2370,https://master-3:2379] endpoint health
will check each of the nodes and report
https://master-3:2379 is healthy: successfully committed proposal: took = 3.665702ms
https://master-1:2379 is healthy: successfully committed proposal: took = 3.202865ms
https://master-2:2370 is unhealthy: failed to connect: dial tcp 192.168.33.102:2370: getsockopt: no route to host
Error: unhealthy cluster
in case one of the etcd cluster members is down, however, this requires at least some knowledge of the etcd cluster by the operator.
I have the same issue with:
etcdctl version: 3.2.22
API version: 3.2
I get "unhealthy cluster" as well when using "etcdctl member list" even though 2/3 is online.
I did however notice that when going from 2/3 to 1/3 there where no leader:
[foo1@bar ~]# etcdctl3 endpoint status
Failed to get the status of endpoint https://foo1:2379 (context deadline exceeded)
Failed to get the status of endpoint https://foo2:2379 (context deadline exceeded)
https://foo3:2379, 6074b97ec42826bg, 3.2.22, 16 MB, false, 760, 20792785
Note the false in the last line.
Use below command for V3.3.XX etcdctl etcdctl --endpoints=https://192.168.56.113:2379,https://192.168.56.118:2379,https://192.168.56.119:2379 --key-file="/etc/kubernetes/pki/etcd/client-key.pem" --cert-file="/etc/kubernetes/pki/etcd/client.pem" --ca-file="/etc/kubernetes/pki/etcd/ca.pem" member list -w table
etcdctl --endpoints=https://192.168.56.113:2379,https://192.168.56.118:2379,https://192.168.56.119:2379 --key="/etc/kubernetes/pki/etcd/client-key.pem" --cert="/etc/kubernetes/pki/etcd/client.pem" --cacert="/etc/kubernetes/pki/etcd/ca.pem" member list -w table +------------------+---------+----------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+----------+-----------------------------+-----------------------------+------------+ | 29338f91dec951c0 | started | master01 | https://192.168.56.113:2380 | https://192.168.56.113:2379 | false | | 438679b543748ad8 | started | master02 | https://192.168.56.118:2380 | https://192.168.56.118:2379 | false | | 48544942dc6b8509 | started | master03 | https://192.168.56.119:2380 | https://192.168.56.119:2379 | false | +------------------+---------+----------+-----------------------------+-----------------------------+------------+
I'm using latest release of etcd at the time of writing this comment(etcd-v3.4.9) and the following command works for me:
[root@master01 ~]# etcdctl --endpoints=https://192.168.122.101:2379,https://192.168.122.102:2379,https://192.168.122.103:2379 --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem member list -w table
+------------------+---------+----------+------------------------------+------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------+------------------------------+------------------------------+------------+
| 148f9f6172465414 | started | master02 | https://192.168.122.102:2380 | https://192.168.122.102:2379 | false |
| 79ad015295a746a9 | started | master01 | https://192.168.122.101:2380 | https://192.168.122.101:2379 | false |
| f857eddf41ed1741 | started | master03 | https://192.168.122.103:2380 | https://192.168.122.103:2379 | false |
+------------------+---------+----------+------------------------------+------------------------------+------------+
etcd server version
etcd client version
Start a 3 node etcd cluster
Poweroff one of the etcd members
The member list commands fails
The cluster is reported healthy, but no nodes are marked unhealthy even though member
7931e79c0d8b47c5
is powered off.