Closed gnuoy closed 5 years ago
In my setup each etcd unit has an individual server certificate that includes an IP SAN with the ip address of the unit.
If I generate a new certificate that has IP SAN entries for all units in the cluster and then use then use that certificate on all units the problem goes away. I don't like this workaround though as it means I need to cert every time I scale out.
In my current setup I have three etcd nodes: 10.53.82.141, 10.53.82.217 and 10.53.82.233 I have shutdown etcd on 10.53.82.141 and 10.53.82.141 is the first entry in my Endpoints list. Running the client and capturing the error message from https://github.com/etcd-io/etcd/blob/master/vendor/google.golang.org/grpc/credentials/credentials.go#L166 gives:
{"level":"info","ts":1536056536.818413,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bo4mljc94cxt","addresses":["https://10.53.82.141:2379","https://10.53.82.217:2379","https://10.53.82.233:2379"]}
{"level":"info","ts":1536056536.8184912,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185225,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185527,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8217986,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
x509: certificate is valid for 10.53.82.233, not 10.53.82.141
{"level":"info","ts":1536056536.8290203,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
x509: certificate is valid for 10.53.82.217, not 10.53.82.141
{"level":"info","ts":1536056536.8310082,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}```
It looks like the chain of events (when servername is not used):
1) newClient is called from clientv3/client.go. This first endpoint is extracted from the Endpoint list and used to call client.dial.
func newClient(cfg *Config) (*Client, error) {
...
dialEndpoint := cfg.Endpoints[0]
...
conn, err := client.dial(dialEndpoint, grpc.WithBalancerName(roundRobinBalancerName))
2) client.dial in turn calls DialContext from ./vendor/google.golang.org/grpc/clientconn.go
func (c *Client) dial(ep string, dopts ...grpc.DialOption) (*grpc.ClientConn, error) {
...
conn, err := grpc.DialContext(dctx, target, opts...)
3) DialContext sets the authority to be url for the first endpoint ./vendor/google.golang.org/grpc/clientconn.go
func DialContext(ctx context.Context, target string, opts ...DialOption) (conn *ClientConn, err error) {
...
cc := &ClientConn{
target: target,
csMgr: &connectivityStateManager{},
conns: make(map[*addrConn]struct{}),
blockingpicker: newPickerWrapper(),
}
...
creds := cc.dopts.copts.TransportCredentials
if creds != nil && creds.Info().ServerName != "" {
cc.authority = creds.Info().ServerName
} else if cc.dopts.insecure && cc.dopts.copts.Authority != "" {
cc.authority = cc.dopts.copts.Authority
} else {
// Use endpoint from "scheme://authority/endpoint" as the default
// authority for ClientConn.
cc.authority = cc.parsedTarget.Endpoint
}
...
4) Connection attempts to any endpoint other than the first now fail because their IP does not match the IP of the first endpoint. ./vendor/google.golang.org/grpc/credentials/credentials.go
func (c *tlsCreds) ClientHandshake(ctx context.Context, authority string, rawConn net.Conn) (_ net.Conn, _ AuthInfo, err error) {
// use local cfg to avoid clobbering ServerName if using multiple endpoints
cfg := cloneTLSConfig(c.config)
if cfg.ServerName == "" {
colonPos := strings.LastIndex(authority, ":")
if colonPos == -1 {
colonPos = len(authority)
}
cfg.ServerName = authority[:colonPos]
Experiencing the same issue - however with SRV discovery which will use FQDNs for the endpoint list not IP addresses.
Similarly only the first endpoint will ever successfully be balanced.
My client is Hashicorp vault also, so related to https://github.com/hashicorp/vault/issues/4961.
Using env ETCD_CLIENT_DEBUG=on GRPC_GO_LOG_VERBOSITY_LEVEL=INFO
has helped debug.
Highlights from the log lines follow...
All endpoints are successfully enumerated:
{"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}
However only a single endpoint will result in a READY
subconn (etcd3
in this case):
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}
It will generally pick a different etcd[1,2,3]
server each time I restart vault. Just depends on the order the SRV record is returned.
As per the TLS advice for SRV discovery - the SRV domain, fqdn and IP address. Ditto for etcd[2,3]
servers.
Subject: CN = etcd1.nodes.example.com
X509v3 Subject Alternative Name:
DNS:etcd1.nodes.example.com, DNS:*.etcd.example.com, DNS:etcd.example.com, IP Address:10.xx.xx.xx
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5091121,"caller":"balancer/balancer.go:39","msg":"registered balancer","policy":"etcd-client-roundrobin-balanced","name":"etcd-etcd-client-roundrobin-balanced"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5548,"caller":"balancer/balancer.go:82","msg":"built balancer","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","resolver-target":"endpoint://client-bqhf6xf64qh7/etcd3.nodes.example.com.:2379"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5549786,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550096,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550244,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.572446,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5726738,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.573283,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5733666,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com.:2379 (0xc000868b90)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5749798,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com:2379","https://etcd2.nodes.example.com:2379","https://etcd3.nodes.example.com:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5750618,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd3.nodes.example.com.:2379","subconn":"0xc000868b90"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575098,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd2.nodes.example.com.:2379","subconn":"0xc000868bb0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57512,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd1.nodes.example.com.:2379","subconn":"0xc000868bd0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751421,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575167,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751889,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57521,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"READY","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5752327,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575254,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575295,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: ==> Vault server configuration:
Dec 13 02:30:40 runc[31079]: Api Address: https://etcd2.nodes.example.com:8200
Dec 13 02:30:40 runc[31079]: Cgo: disabled
Dec 13 02:30:40 runc[31079]: Cluster Address: https://etcd2.nodes.example.com:8201
Dec 13 02:30:40 runc[31079]: Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
Dec 13 02:30:40 runc[31079]: Log Level: (not set)
Dec 13 02:30:40 runc[31079]: Mlock: supported: true, enabled: true
Dec 13 02:30:40 runc[31079]: Storage: etcd (HA available)
Dec 13 02:30:40 runc[31079]: Version: Vault v0.11.5
Dec 13 02:30:40 runc[31079]: Version Sha: a59ffa4a0f09bbf198241fe6793a96722789b639
Dec 13 02:30:40 runc[31079]: ==> Vault server started! Log data will stream in below:
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.575287,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5754151,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5860226,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dmic-16-dp68-h1-09irec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5864868,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO] core: vault is unsealed
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO] core: entering standby mode
The code tracing done by @gnuoy is illuminating in that the actual sequence of events is:
ClientHandshake
, the winner will set the ServerName
on the TLS config, e.g. etcd1.example.com
ServerName
So I can only conclude that using the same authority for all subconnections in the balancer is fundamentally flawed when used with TLS.
Under the current implementation there are 3 possible solutions:
*.example.com
as a SAN to each member - not great/subscribe
/cc @gyuho @jpbetz
@jsok Is the TLS config in etcd client side or in the gRPC side? Can we switch to use a fresh config if the endpoint is different from the previous through balancer? Can you take look if we can fix this problem from etcd client side?
@xiang90 @jpbetz I can reproduce this. Let me see if I can fix this in etcd client side.
We're experiencing exactly the same issue as @gnuoy :
One certificate per server IP address and etcdv3 client fails if first endpoint unavailable.
@gyuho I am also running into the same issue - From the comments above it looks you have been working on a patch. Any update on it.
@gyuho - It's difficult to decipher if this has been fixed or if it's still pending now as there are about a dozen cross-referenced issues and PRs.
I looked at the status of this last week. I believe this is still pending on https://github.com/etcd-io/etcd/pull/10489. I could be wrong though.
Any updates on this? We're still running into this with Kubernetes v1.14.1 and etcd 3.3.12
We should really get an update on this for k8s - v1.15
Just discussed with gRPC team, and got some good feedback. I will rework on this in the next few weeks.
Any updates on this? We're still running into this with Kubernetes v1.15.0 and etcd 3.3.13
FWIW, we work around this problem by placing TCP reverse proxies on each node that connect to etcd. Each client can connect etcd via localhost:12379. Since TLS certificates of etcd servers have "localhost" SAN and "127.0.0.1" IP-SAN, the problem can be avoided.
A possible better workaround would be to place TLS-terminating TCP reverse proxies. That is, it terminates TLS both for client connections and for etcd servers while validating server certificates with their public IP addresses.
The fix https://github.com/etcd-io/etcd/pull/10911 has been merged, and will be released in 3.4 https://github.com/etcd-io/etcd/issues/10943.
@bahar-p @Protopopys ... I was just wondering the same thing. Question still is ... will Kubernetes / kubeadm or other installer tools move etcd 3.4 (when released) or will etcd do a backport for etcd 3.3.xx and the tools just got to jump the point release.
See my question at https://github.com/etcd-io/etcd/pull/10911#issuecomment-515735841
@gyuho I tried this code on my staging environment and I still face the same issue:
ETCD_CLIENT_DEBUG=1 ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints 10.10.10.242:2379,10.10.10.252:2379,10.10.10.139:2379 --cacert /etc/ssl/etcd-root-ca.pem --cert /etc/ssl/client.pem --key /etc/ssl/client-key.pem put foo bar
{"level":"debug","ts":"2019-08-07T22:54:09.070Z","caller":"balancer/balancer.go:60","msg":"registered balancer","policy":"picker-roundrobin-balanced","name":"etcd-picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:09.074Z","caller":"balancer/balancer.go:97","msg":"built balancer","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","resolver-target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:148","msg":"resolved","picker":"picker-error","balancer-id":"bw3rmsvwfygd","addresses":["10.10.10.139:2379","10.10.10.242:2379","10.10.10.252:2379"]}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.252:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.139:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.098Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.602Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.603Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"warn","ts":"2019-08-07T22:54:14.074Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.10.10.242:2379: connect: connection refused\""}
10.10.10.242
is down on purpose.
And from the same machine I can check the health of those etcd nodes just fine:
root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.252:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.252:2379 is healthy: successfully committed proposal: took = 18.506072ms
root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.139:2379 is healthy: successfully committed proposal: took = 16.068855ms
So I'm pretty confused here. I double checked to make sure I was working off a branch with your changes and I'm...
root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem --key=/etc/ssl/client-key.pem --debug endpoint status
[...]
https://10.10.10.139:2379, 146b0b16a4192d16, 3.5.0-pre, 20 kB, false, false, 48, 71, 71,
~/go/src/github.com/etcd-io/etcd (release-3.4)$ make build
GO_BUILD_FLAGS="-v" ./build
./bin/etcd --version
etcd Version: 3.5.0-pre
Git SHA: 524278c18
Go Version: go1.12
Go OS/Arch: linux/amd64
./bin/etcdctl version
etcdctl version: 3.5.0-pre
API version: 3.5
(sorry for commenting a closed ticket)
Describe the bug The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable.
To Reproduce I setup an etcd cluster with http (port 2378) and https (port 2379) listeners. Then used the etcd go client library to query the cluster. Then I took down the first unit listed when the client was established (in my case the client on port 10.53.82.119). The http go client continues to work but the https one fails.
https client:
Fails with:
2018-07-21 11:34:05.728613 I | context deadline exceeded
http client: