etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.82k stars 9.77k forks source link

v2 proxy fails after upgrading cluster from v2 to v3 #7442

Closed raoofm closed 7 years ago

raoofm commented 7 years ago

As discussed in #7344, I agree that proxy is compatible but it fails if ssl is enabled.

Our scenario v2.3.8 cluster is ssl enable and after upgrading to v3.0.17 the v2.3.8 proxy fails with certificate error.

2017-03-07 21:17:45.736054 W | etcdserver: could not get cluster response from https://node01.example.com:2380: Get https://node01.example.com:2380/members: x509: certificate signed by unknown authority

Existing config /var/lib/etcdProxy/etcd -data-dir /var/lib/etcdProxy/proxy/datadir -listen-client-urls http://localhost:2377 -discovery-srv example.com -proxy on

Modification that works. /var/lib/etcdProxy/etcd -data-dir /var/lib/etcdProxy/proxy/datadir -listen-client-urls http://localhost:2377 -peer-trusted-ca-file /var/home/rm/cfssl/ca.pem -discovery-srv example.com -proxy on

raoofm commented 7 years ago

one of the members started as

/var/lib/etcd/etcd -discovery-srv example.com -advertise-client-urls https://node02.example.com:2379 -initial-cluster-state new -name node02 -data-dir /var/lib/etcd/cluster/datadir -listen-peer-urls https://node02.example.com:2380 -initial-advertise-peer-urls https://node02.example.com:2380 -listen-client-urls https://node02.example.com:2379 -trusted-ca-file /var/home/rm/cfssl/ca.pem -cert-file /var/home/rm/cfssl/member2.pem -key-file /var/home/rm/cfssl/member2-key.pem -peer-trusted-ca-file /var/home/rm/cfssl/ca.pem -peer-cert-file /var/home/rm/cfssl/member2.pem -peer-key-file /var/home/rm/cfssl/member2-key.pem -heartbeat-interval 200 -election-timeout 2000 -initial-cluster-token etcd-cluster-dev >>/var/log/etcd/etcd.out 2>&1 &

gyuho commented 7 years ago

proxy is compatible but it fails if ssl is enabled.

Do you mean etcd fails or v2 proxy fails?

x509: certificate signed by unknown authority

Are these certs valid? Have you tried setting up simple single node to see if it works?

heyitsanthony commented 7 years ago

There was some TLS tightening in 3.0 (e.g., all the SRV fixes) and a TLS version upgrade to 1.2. It's possible that 2.3.x was too permissive about certs so now it's rejecting in 3.0.17 like it should have in the first place.

raoofm commented 7 years ago

++ @xiang90 @philips @gyuho I mean v2 proxy fails to connect to v3 etcd. Certs are valid. 3 members started fine. I mentioned the flags used to start v3 etcd. I also mentioned old etcd proxy flags that fails and new with certs that works for proxy.

@heyitsanthony I totally understand that but I think this needs to be loosened atleast for clients coming via v2 proxy on a v3 cluster as it is breaking existing clients. We use this in production and our set up is described at https://github.com/coreos/etcd/blob/master/Documentation/production-users.md#vonage

We have n proxies running alongside each app on the same machine. It is a huge impact, we can't figure out and plan to upgrade all proxies (or restart with new flags).

This is a stopper for us to upgrade etcd from v2 to v3. We really wanted to use v3 with vault.

I would say a flag to enable this behavior can be introduced and turning it off by default. Though I agree security is important but here there is no route to smoothly upgrade. Tons of clients would break. We shouldn't break existing clients.

Even https://github.com/coreos/etcd/blob/master/Documentation/v2/proxy.md doesn't talk about setting trusted ca.

Thoughts?

heyitsanthony commented 7 years ago

@raoofm OK. Just to confirm:

raoofm commented 7 years ago

@heyitsanthony yes

heyitsanthony commented 7 years ago

@raoofm, @gyuho and I were able to reproduce this on our side. A fix that appears to work out of the box is to append the etcd ca cert to the system certs file (usually /etc/ssl/certs/ca-certificates.crt). If this isn't an option, we can investigate a patch for 3.0.17.

raoofm commented 7 years ago

@heyitsanthony @gyuho appending to system certs worked, but it is the same thing as adding the flag and restarting etcdProxy. Bcoz even after appending the certs to ca, proxy still needed a restart.

This mean all the proxies needs to be reached out and reconfigured/restarted before upgrade - which shouldn't have been the case.

raoofm commented 7 years ago

The positive hope on this is whether the existing proxy machines already has this ca as part of ca-bundle. We have roughly about 150 proxies deployed. This is an action item for me to verify. Can you guys think about other options?

heyitsanthony commented 7 years ago

OK, it wasn't clear the proxies couldn't be restarted at all.

This mean all the proxies needs to be reached out and reconfigured/restarted before upgrade - which shouldn't have been the case.

Except the TLS configuration was wrong from the start. The proxy shouldn't have been able to connect to etcd without checking the authenticity of the certs. At this point it's about trying to figure out how to work around this broken configuration which should have never worked in the first place.

Can you guys think about other options?

We can investigate a server-side patch so the authentication check won't happen.

raoofm commented 7 years ago

Except the TLS configuration was wrong from the start. The proxy shouldn't have been able to connect to etcd without checking the authenticity of the certs. At this point it's about trying to figure out how to work around this broken configuration which should have never worked in the first place.

agree

heyitsanthony commented 7 years ago

@raoofm it appears what looked like was a repro either was not or it worked then was lost; 2.3.8 proxy / 2.3.8 etcd is always giving a cert error when the proxy does not have access to the etcd ca. Is there a reliable way to reproduce this from scratch?

raoofm commented 7 years ago

ok I'll try and mention the steps

raoofm commented 7 years ago

@heyitsanthony @gyuho luckily for us on qa, preprod and prod the ca signing the certs is already part of the default ca-bundle and the upgrade was success (not on dev and that is why we see this issue while testing the upgrade, sorry).

we are currently upgrading to 3.1.3 and can test out how it plays with vault and keep you guys posted.

Thanks much for your support guys. This issue can be closed.