etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.98k stars 9.79k forks source link

etcd go client fails when querying a cluster with a down node #9949

Closed gnuoy closed 5 years ago

gnuoy commented 6 years ago

Describe the bug The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable.

To Reproduce I setup an etcd cluster with http (port 2378) and https (port 2379) listeners. Then used the etcd go client library to query the cluster. Then I took down the first unit listed when the client was established (in my case the client on port 10.53.82.119). The http go client continues to work but the https one fails.

https client:

    ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

    cfg := clientv3.Config{
        Endpoints:   []string{"https://10.53.82.119:2379" ,"https://10.53.82.150:2379", "https://10.53.82.157:2379"},
        DialTimeout: 5 * time.Second,
    }

    cert := "/home/liam/tls_vault_certs/etcd-cert.pem"
    key := "/home/liam/tls_vault_certs/etcd.key"
    ca := "/home/liam/tls_vault_certs/etcd-ca.pem"
    tls := transport.TLSInfo{
        TrustedCAFile: ca,
        CertFile:      cert,
        KeyFile:       key,
    }

    tlscfg, err := tls.ClientConfig()
    cfg.TLS = tlscfg

    cli, err := clientv3.New(cfg)
    if err != nil {
        log.Fatal(err)
    }
    defer cli.Close()
    kv := clientv3.NewKV(cli)

Fails with: 2018-07-21 11:34:05.728613 I | context deadline exceeded

http client:

        ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

        cfg := clientv3.Config{
                Endpoints:   []string{"http://10.53.82.119:2378" ,"http://10.53.82.150:2378", "http://10.53.82.157:2378"},
                DialTimeout: 5 * time.Second,
        }

        cli, err := clientv3.New(cfg)
        if err != nil {
                log.Fatal(err)
        }
        defer cli.Close()
        kv := clientv3.NewKV(cli)
gnuoy commented 6 years ago

In my setup each etcd unit has an individual server certificate that includes an IP SAN with the ip address of the unit.

If I generate a new certificate that has IP SAN entries for all units in the cluster and then use then use that certificate on all units the problem goes away. I don't like this workaround though as it means I need to cert every time I scale out.

gnuoy commented 6 years ago

In my current setup I have three etcd nodes: 10.53.82.141, 10.53.82.217 and 10.53.82.233 I have shutdown etcd on 10.53.82.141 and 10.53.82.141 is the first entry in my Endpoints list. Running the client and capturing the error message from https://github.com/etcd-io/etcd/blob/master/vendor/google.golang.org/grpc/credentials/credentials.go#L166 gives:


{"level":"info","ts":1536056536.818413,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bo4mljc94cxt","addresses":["https://10.53.82.141:2379","https://10.53.82.217:2379","https://10.53.82.233:2379"]}
{"level":"info","ts":1536056536.8184912,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185225,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185527,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8217986,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}

x509: certificate is valid for 10.53.82.233, not 10.53.82.141
{"level":"info","ts":1536056536.8290203,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}

x509: certificate is valid for 10.53.82.217, not 10.53.82.141
{"level":"info","ts":1536056536.8310082,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}```
gnuoy commented 6 years ago

It looks like the chain of events (when servername is not used):

1) newClient is called from clientv3/client.go. This first endpoint is extracted from the Endpoint list and used to call client.dial.

func newClient(cfg *Config) (*Client, error) {
...
    dialEndpoint := cfg.Endpoints[0]
...
    conn, err := client.dial(dialEndpoint, grpc.WithBalancerName(roundRobinBalancerName))

2) client.dial in turn calls DialContext from ./vendor/google.golang.org/grpc/clientconn.go

func (c *Client) dial(ep string, dopts ...grpc.DialOption) (*grpc.ClientConn, error) {
...
    conn, err := grpc.DialContext(dctx, target, opts...)

3) DialContext sets the authority to be url for the first endpoint ./vendor/google.golang.org/grpc/clientconn.go

func DialContext(ctx context.Context, target string, opts ...DialOption) (conn *ClientConn, err error) {
...
        cc := &ClientConn{
                target: target,
                csMgr:  &connectivityStateManager{},
                conns:  make(map[*addrConn]struct{}),
                blockingpicker: newPickerWrapper(),
        }
...
        creds := cc.dopts.copts.TransportCredentials
        if creds != nil && creds.Info().ServerName != "" {
                cc.authority = creds.Info().ServerName
        } else if cc.dopts.insecure && cc.dopts.copts.Authority != "" {
                cc.authority = cc.dopts.copts.Authority
        } else {
                // Use endpoint from "scheme://authority/endpoint" as the default
                // authority for ClientConn.
                cc.authority = cc.parsedTarget.Endpoint
        }
...

4) Connection attempts to any endpoint other than the first now fail because their IP does not match the IP of the first endpoint. ./vendor/google.golang.org/grpc/credentials/credentials.go

func (c *tlsCreds) ClientHandshake(ctx context.Context, authority string, rawConn net.Conn) (_ net.Conn, _ AuthInfo, err error) {
    // use local cfg to avoid clobbering ServerName if using multiple endpoints
    cfg := cloneTLSConfig(c.config)
    if cfg.ServerName == "" {
        colonPos := strings.LastIndex(authority, ":")
        if colonPos == -1 {
            colonPos = len(authority)
        }
        cfg.ServerName = authority[:colonPos]
jsok commented 5 years ago

Experiencing the same issue - however with SRV discovery which will use FQDNs for the endpoint list not IP addresses.

Similarly only the first endpoint will ever successfully be balanced.

My client is Hashicorp vault also, so related to https://github.com/hashicorp/vault/issues/4961. Using env ETCD_CLIENT_DEBUG=on GRPC_GO_LOG_VERBOSITY_LEVEL=INFO has helped debug.

Highlights from the log lines follow...

All endpoints are successfully enumerated:

{"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}

However only a single endpoint will result in a READY subconn (etcd3 in this case):

Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}

It will generally pick a different etcd[1,2,3] server each time I restart vault. Just depends on the order the SRV record is returned.

TLS CN and SANs

As per the TLS advice for SRV discovery - the SRV domain, fqdn and IP address. Ditto for etcd[2,3] servers.

Subject: CN = etcd1.nodes.example.com
X509v3 Subject Alternative Name:
    DNS:etcd1.nodes.example.com, DNS:*.etcd.example.com, DNS:etcd.example.com, IP Address:10.xx.xx.xx

Full startup logs

Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5091121,"caller":"balancer/balancer.go:39","msg":"registered balancer","policy":"etcd-client-roundrobin-balanced","name":"etcd-etcd-client-roundrobin-balanced"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5548,"caller":"balancer/balancer.go:82","msg":"built balancer","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","resolver-target":"endpoint://client-bqhf6xf64qh7/etcd3.nodes.example.com.:2379"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5549786,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550096,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550244,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.572446,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5726738,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.573283,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5733666,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com.:2379 (0xc000868b90)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5749798,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com:2379","https://etcd2.nodes.example.com:2379","https://etcd3.nodes.example.com:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5750618,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd3.nodes.example.com.:2379","subconn":"0xc000868b90"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575098,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd2.nodes.example.com.:2379","subconn":"0xc000868bb0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57512,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd1.nodes.example.com.:2379","subconn":"0xc000868bd0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751421,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575167,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751889,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57521,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"READY","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5752327,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575254,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575295,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: ==> Vault server configuration:
Dec 13 02:30:40 runc[31079]:              Api Address: https://etcd2.nodes.example.com:8200
Dec 13 02:30:40 runc[31079]:                      Cgo: disabled
Dec 13 02:30:40 runc[31079]:          Cluster Address: https://etcd2.nodes.example.com:8201
Dec 13 02:30:40 runc[31079]:               Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
Dec 13 02:30:40 runc[31079]:                Log Level: (not set)
Dec 13 02:30:40 runc[31079]:                    Mlock: supported: true, enabled: true
Dec 13 02:30:40 runc[31079]:                  Storage: etcd (HA available)
Dec 13 02:30:40 runc[31079]:                  Version: Vault v0.11.5
Dec 13 02:30:40 runc[31079]:              Version Sha: a59ffa4a0f09bbf198241fe6793a96722789b639
Dec 13 02:30:40 runc[31079]: ==> Vault server started! Log data will stream in below:
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.575287,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5754151,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5860226,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dmic-16-dp68-h1-09irec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5864868,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO]  core: vault is unsealed
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO]  core: entering standby mode
jsok commented 5 years ago

The code tracing done by @gnuoy is illuminating in that the actual sequence of events is:

  1. 3 subconns are spawned
  2. They race to ClientHandshake, the winner will set the ServerName on the TLS config, e.g. etcd1.example.com
  3. The remaining subconns (which seems to share the parent clientconn and therefore TLS config) will now be forced to verify against the same ServerName

So I can only conclude that using the same authority for all subconnections in the balancer is fundamentally flawed when used with TLS.

Under the current implementation there are 3 possible solutions:

  1. Disable TLS verification - Giant 👎
  2. Add *.example.com as a SAN to each member - not great
  3. All members must list every other member as a SAN - defeats the purpose of a HA cluster since all members must have their certs rolled each time the cluster is reconfigured
alexbrand commented 5 years ago

/subscribe

xiang90 commented 5 years ago

/cc @gyuho @jpbetz

@jsok Is the TLS config in etcd client side or in the gRPC side? Can we switch to use a fresh config if the endpoint is different from the previous through balancer? Can you take look if we can fix this problem from etcd client side?

gyuho commented 5 years ago

@xiang90 @jpbetz I can reproduce this. Let me see if I can fix this in etcd client side.

dunjut commented 5 years ago

We're experiencing exactly the same issue as @gnuoy :

One certificate per server IP address and etcdv3 client fails if first endpoint unavailable.

pravinsinghal commented 5 years ago

@gyuho I am also running into the same issue - From the comments above it looks you have been working on a patch. Any update on it.

timothysc commented 5 years ago

@gyuho - It's difficult to decipher if this has been fixed or if it's still pending now as there are about a dozen cross-referenced issues and PRs.

alexbrand commented 5 years ago

I looked at the status of this last week. I believe this is still pending on https://github.com/etcd-io/etcd/pull/10489. I could be wrong though.

bahar-p commented 5 years ago

Any updates on this? We're still running into this with Kubernetes v1.14.1 and etcd 3.3.12

timothysc commented 5 years ago

We should really get an update on this for k8s - v1.15

gyuho commented 5 years ago

Just discussed with gRPC team, and got some good feedback. I will rework on this in the next few weeks.

Protopopys commented 5 years ago

Any updates on this? We're still running into this with Kubernetes v1.15.0 and etcd 3.3.13

ymmt2005 commented 5 years ago

FWIW, we work around this problem by placing TCP reverse proxies on each node that connect to etcd. Each client can connect etcd via localhost:12379. Since TLS certificates of etcd servers have "localhost" SAN and "127.0.0.1" IP-SAN, the problem can be avoided.

A possible better workaround would be to place TLS-terminating TCP reverse proxies. That is, it terminates TLS both for client connections and for etcd servers while validating server certificates with their public IP addresses.

gyuho commented 5 years ago

The fix https://github.com/etcd-io/etcd/pull/10911 has been merged, and will be released in 3.4 https://github.com/etcd-io/etcd/issues/10943.

frittentheke commented 5 years ago

@bahar-p @Protopopys ... I was just wondering the same thing. Question still is ... will Kubernetes / kubeadm or other installer tools move etcd 3.4 (when released) or will etcd do a backport for etcd 3.3.xx and the tools just got to jump the point release.

See my question at https://github.com/etcd-io/etcd/pull/10911#issuecomment-515735841

abezard commented 5 years ago

@gyuho I tried this code on my staging environment and I still face the same issue:

ETCD_CLIENT_DEBUG=1 ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints 10.10.10.242:2379,10.10.10.252:2379,10.10.10.139:2379 --cacert /etc/ssl/etcd-root-ca.pem --cert /etc/ssl/client.pem --key /etc/ssl/client-key.pem put foo bar
{"level":"debug","ts":"2019-08-07T22:54:09.070Z","caller":"balancer/balancer.go:60","msg":"registered balancer","policy":"picker-roundrobin-balanced","name":"etcd-picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:09.074Z","caller":"balancer/balancer.go:97","msg":"built balancer","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","resolver-target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:148","msg":"resolved","picker":"picker-error","balancer-id":"bw3rmsvwfygd","addresses":["10.10.10.139:2379","10.10.10.242:2379","10.10.10.252:2379"]}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.252:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.139:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.098Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.602Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.603Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"warn","ts":"2019-08-07T22:54:14.074Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.10.10.242:2379: connect: connection refused\""}

10.10.10.242 is down on purpose.

And from the same machine I can check the health of those etcd nodes just fine:

root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.252:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.252:2379 is healthy: successfully committed proposal: took = 18.506072ms
root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.139:2379 is healthy: successfully committed proposal: took = 16.068855ms

So I'm pretty confused here. I double checked to make sure I was working off a branch with your changes and I'm...

root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint status
[...]
https://10.10.10.139:2379, 146b0b16a4192d16, 3.5.0-pre, 20 kB, false, false, 48, 71, 71, 
~/go/src/github.com/etcd-io/etcd (release-3.4)$ make build
GO_BUILD_FLAGS="-v" ./build
./bin/etcd --version
etcd Version: 3.5.0-pre
Git SHA: 524278c18
Go Version: go1.12
Go OS/Arch: linux/amd64
./bin/etcdctl version
etcdctl version: 3.5.0-pre
API version: 3.5

(sorry for commenting a closed ticket)

abezard commented 5 years ago

fixed it with that: https://github.com/moondev/kubernetes/commit/45f6cb2c96705d03b3b77c16a3680b5526e14729#diff-c9fae1df26aedd520ef93a008d255581R136-R148