etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.44k stars 9.73k forks source link

ETCD doesn't automatically load changes to ca bundles for peer-trusted-ca-file or trusted-ca-file #11555

Open relyt0925 opened 4 years ago

relyt0925 commented 4 years ago

Etcd cannot handle cert bundles in the peer-trusted-ca-file or trusted-ca-file section. Without the ability to handle CA bundles, it is impossible to do a 0 downtime approach to CA rotation without resigning all active client and server certs at once.

If a CA bundle was allowed: A new CA could be created and made valid in all components in the first interation. Then client certs can be resigned with the new CA since the server components have the new CA plus the old CA in it's trust bundle. Once all clients have been resigned and downloaded the old + new CA the server components can be signed with the new CA and then the old CA can be effectively removed.

It appears this was meant to be fixed but I am able to replicate the issue in an etcd deployment today. I will expose all the certs and command line configurations in this issue so the exact steps can be replicated.

relyt0925 commented 4 years ago
      /usr/local/bin/etcd
      --data-dir=/var/etcd/data
      --name=etcd-boka410001nkm6q6gomg-qbjxcbfm8s
      --initial-advertise-peer-urls=https://etcd-boka410001nkm6q6gomg-qbjxcbfm8s.etcd-boka410001nkm6q6gomg.master-boka410001nkm6q6gomg.svc:2380
      --listen-peer-urls=https://0.0.0.0:2380
      --listen-client-urls=https://0.0.0.0:2379
      --advertise-client-urls=https://etcd-boka410001nkm6q6gomg-qbjxcbfm8s.etcd-boka410001nkm6q6gomg.master-boka410001nkm6q6gomg.svc:2379
      --initial-cluster=etcd-boka410001nkm6q6gomg-qbjxcbfm8s=https://etcd-boka410001nkm6q6gomg-qbjxcbfm8s.etcd-boka410001nkm6q6gomg.master-boka410001nkm6q6gomg.svc:2380,etcd-boka410001nkm6q6gomg-wjcb49jdt9=https://etcd-boka410001nkm6q6gomg-wjcb49jdt9.etcd-boka410001nkm6q6gomg.master-boka410001nkm6q6gomg.svc:2380,etcd-boka410001nkm6q6gomg-xq2d55qxxc=https://etcd-boka410001nkm6q6gomg-xq2d55qxxc.etcd-boka410001nkm6q6gomg.master-boka410001nkm6q6gomg.svc:2380
      --initial-cluster-state=existing
      --strict-reconfig-check=true
      --listen-metrics-urls=http://0.0.0.0:2381
      --peer-client-cert-auth=true
      --peer-trusted-ca-file=/etc/etcdtls/member/peer-tls/peer-ca.crt
      --peer-cert-file=/etc/etcdtls/member/peer-tls/peer.crt
      --peer-key-file=/etc/etcdtls/member/peer-tls/peer.key
      --client-cert-auth=true
      --trusted-ca-file=/etc/etcdtls/member/server-tls/server-ca.crt
      --cert-file=/etc/etcdtls/member/server-tls/server.crt
      --key-file=/etc/etcdtls/member/server-tls/server.key

server-ca.crt has two cas in it

cat /etc/etcdtls/member/server-tls/server-ca.crt
-----BEGIN CERTIFICATE-----
MIIC9DCCAdygAwIBAgIUEf3wlI7MXrgAAz3Ph7qyRTQvdkowDQYJKoZIhvcNAQEL
BQAwEjEQMA4GA1UEAxMHcm9vdC1jYTAeFw0yMDAxMjIxOTM1MDBaFw0yNTAxMjAx
OTM1MDBaMBIxEDAOBgNVBAMTB3Jvb3QtY2EwggEiMA0GCSqGSIb3DQEBAQUAA4IB
DwAwggEKAoIBAQCztJ0qkC9zM/FgJNO6DoainaBdebEmHzRjDd7oj4b2EKt4itGJ
d62Ix04LJaLu4ojJZO6Ez7MXlxMwBSa7nd48kjHD7/xOOBlzEJPhNeva0N7rSJBO
8AwBwvkPtWFj+qaSfDMby0YMAlJu/oJAFkM8KmwJ8X7yQ5SUHXQkBN+uR3BiZEQg
vClg1EYBJjzAx2prKFM1lDQ7jyDJ7ysJUW5I8YrIJ+gOYidIf9mfCFNMbhsyFvvK
RtRRMgL9Zs3/c3ioMURTPX/21kt46c2XKDTYrOU6FV4UOxE+i1P/Fqp3NjT16wJt
qKGWlM3FUrDDoXjhb/M4y6ACPdO5Wj153p7DAgMBAAGjQjBAMA4GA1UdDwEB/wQE
AwIBBjAPBgNVHRMBAf8EBTADAQH/MB0GA1UdDgQWBBRJC2HyqoPIqV4ClFJABdtE
IWD+hjANBgkqhkiG9w0BAQsFAAOCAQEACos71kORZLESMz02SkjU6+eofFs6QqIw
GvdIorN8oNNC8W5BTuZVFD33MM2ztLNH7yIEYn1Me58GQpWSHPe3x9Cr20Wn24Rq
DHB3hfErN9owaTJuhMxPT2NURPfgJ01mvBUHT8JEmd3PH9dZkDR/cKN+YuLM8fGX
VY/EpprDKHhfyuzVnY04BBpsZeGPZVeVnomoaeewPWszlewYsfNWYTrOFcl5BYPX
Z0eEC0Wu2XYqlykjSQzQEs5TL29i5LK2mNkRol8pG0RLNOWG96XuSoX9W0xaLBZA
7ySyAkPE0PY9DVenffWIz2LP8SMi+ip8i9mHQN/D0iMW6Yx40ofHcw==
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIID4jCCAsqgAwIBAgIUPSVYs0fx95fiLIUhMAWf25wbrGYwDQYJKoZIhvcNAQEL
BQAwgYgxCzAJBgNVBAYTAlVTMQ4wDAYDVQQIEwVUZXhhczEPMA0GA1UEBxMGQXVz
dGluMRMwEQYDVQQKEwpLdWJlcm5ldGVzMQswCQYDVQQLEwJDQTE2MDQGA1UEAxMt
Ym9qdThocjAwb2dkMnNndXF2dmcta3ViZXJuZXRlcy1jYS0xNTc5NzI5ODc1MB4X
DTIwMDEyMjIxNDcwMFoXDTI1MDEyMDIxNDcwMFowgYgxCzAJBgNVBAYTAlVTMQ4w
DAYDVQQIEwVUZXhhczEPMA0GA1UEBxMGQXVzdGluMRMwEQYDVQQKEwpLdWJlcm5l
dGVzMQswCQYDVQQLEwJDQTE2MDQGA1UEAxMtYm9qdThocjAwb2dkMnNndXF2dmct
a3ViZXJuZXRlcy1jYS0xNTc5NzI5ODc1MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A
MIIBCgKCAQEAr8QRne0kjzvBFR0Thbvp1uSMxGCN0QbCeYBbDkb3VjXaQ2h5iQ8m
W0DTTB5JmCQ3rJgMcFTWnOT6CZQ6fQaffN5zgZ65YyhPnQMMh9yxY31JhedEP40z
DFZY/hvF4kz6rXopHqNODHki3E2Qiz1IiTjZj9x1O6wfRknw4V5GoBTgwFCMhOo+
XXzr6cvgczetnVL00t5h4mXkhSzie5fDcXv6zq6SYtQaODKtK5rqZLMdhd2M1vFz
R8CmWBq75kt/YSJ63ClvCtq1akmOn/uM0kvFlTF0qEHXPUNbp7w7m6F4NsQMFeE/
Kt1uRSt3GCQuCZkqmaIGJj//c7TsnvJ1pwIDAQABo0IwQDAOBgNVHQ8BAf8EBAMC
AQYwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUwRPlQBZletqINYW7nxT9+UVI
S1swDQYJKoZIhvcNAQELBQADggEBACYb+zzBBWL6I8FcsF9NYTC1NaXohsPGPiaS
YUQAw9fAOpNf/wBBilwrq6kXPSzO1gKW8FJBPebQZOw+uOEGVaVqhTWVBBU7+SLf
1JpU2f1mpWlo/lLDukwNC3IXdhuG2uWXeB4xnADmbx0uQszZjivd+ZGBthPqEcNA
GLvyxEnapilKjMNtXpO/tv9UQuhb2LrZ5isF/6EzDtUb2k7bTT/5tYyrD3vWJB66
fx8ZNvCdq635XqxVFg0+tIhh/CH6A0w6BkRgBtvZsX6f+y8+nKtqoviKAxYnTkGB
DJuzqoHDLq8iwVzMzl34beASjzj5o8P7z0qNOGqc79fpWdzxhS0=
-----END CERTIFICATE-----
relyt0925 commented 4 years ago

Since server-ca.crt has two CAs: Clients should be able to connect with certs signed by the second CA. Here I past an example of a cert signed by the second CA. NOTE: These are throw away credentials that are solely meant for the etcd team to be able to replicate the problem

# cat mycert
-----BEGIN CERTIFICATE-----
MIID4zCCAsugAwIBAgIUYXEqYxlz/l/5+wJI4E4o1s/0+EUwDQYJKoZIhvcNAQEL
BQAwgYgxCzAJBgNVBAYTAlVTMQ4wDAYDVQQIEwVUZXhhczEPMA0GA1UEBxMGQXVz
dGluMRMwEQYDVQQKEwpLdWJlcm5ldGVzMQswCQYDVQQLEwJDQTE2MDQGA1UEAxMt
Ym9qdThocjAwb2dkMnNndXF2dmcta3ViZXJuZXRlcy1jYS0xNTc5NzI5ODc1MB4X
DTIwMDEyMjIxNDcwMFoXDTIyMDQyNjIxNDcwMFowSDELMAkGA1UEBhMCVVMxFjAU
BgNVBAgTDVNhbiBGcmFuY2lzY28xCzAJBgNVBAcTAkNBMRQwEgYDVQQDEwtldGNk
LWNsaWVudDCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAKuPZXWsco/3
dGjsKZa7IvU48XLuif63BUUZABdNtqId8QspdyeN1oN0WznO0tWBfh+T+L3m6Bme
Fs/dcI/bSpVC9yr3cfHACxTFWwvY5RMYggV3w2X+gutldaMeDfQQZ2secCi/4zHs
6OM4da166BHRon3Zo0furIdRszLDVRyy8M5csFbtU8kno+MlXvZzhB1O9IFqN7FA
nZ9vfbY0UWg+TVotGSTr/FEsfxc6Rt5ZP4qDBIUSQNBno/9V3lFgxcJ1WzlVi+AH
bZWd9BNQYzyifxz0lUKgXuORUGe7lHfaLEUEDlS8UCV+JlcJNr1lJzSB3H/ZmnGu
Liq91IIo1SMCAwEAAaOBgzCBgDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYI
KwYBBQUHAwIwDAYDVR0TAQH/BAIwADAdBgNVHQ4EFgQUYuwSVcbeYgYLhTIey3eR
J91hiw8wHwYDVR0jBBgwFoAUwRPlQBZletqINYW7nxT9+UVIS1swCwYDVR0RBAQw
AoIAMA0GCSqGSIb3DQEBCwUAA4IBAQCKjyO5OBwXBrLPa4hYuo7PE/a0G4A+bzPG
SgxvcJwAJUC0Sy6msCrzbEaiOvMuuEjCcIlXQ6exX7YkTxOyHd6bqG4rDjUssE8V
CIExwcpBVOA9rz+ABcym0NklIg5IO6ivR6XwjccpRGwuEIaprkE3l8iNbuoZH0Lg
bDtazL7SYwUM61Xzqlw0wPxOLZKZXOtsLV7xYV7uulcK1kZVfkkFDtAqnDiY4mbu
Xk9dO2l8Eu4/9q7kimgnMQGJRzADNIjhPRGL7lyt9SjhldxZGmsGWS3z650CNt5F
smvmmytPffNoX4+QfLWW380GBHYKUwHiIfWM2jGyKRliatCkGuDY
-----END CERTIFICATE-----
# cat mykey
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEAq49ldaxyj/d0aOwplrsi9Tjxcu6J/rcFRRkAF022oh3xCyl3
J43Wg3RbOc7S1YF+H5P4veboGZ4Wz91wj9tKlUL3Kvdx8cALFMVbC9jlExiCBXfD
Zf6C62V1ox4N9BBnax5wKL/jMezo4zh1rXroEdGifdmjR+6sh1GzMsNVHLLwzlyw
Vu1TySej4yVe9nOEHU70gWo3sUCdn299tjRRaD5NWi0ZJOv8USx/FzpG3lk/ioME
hRJA0Gej/1XeUWDFwnVbOVWL4AdtlZ30E1BjPKJ/HPSVQqBe45FQZ7uUd9osRQQO
VLxQJX4mVwk2vWUnNIHcf9maca4uKr3UgijVIwIDAQABAoIBACflCKr8lwdze9aK
VIGAsvhjbYJUhjJ9TPRsg+DnaXj5jXwTAGpqSV/4Rt6CgfS0UCf3uPgwIfkTEir0
S6CoVgevstqDADQ/fFQwMKPopBx0roem+gFV1gv3ZAuyiXwf9Gysf1h6htKtTNrJ
3lOhKSY7oZWybo3jqqstDIbZdYlekbiW4VuH2H2fL8qlHDsRG7mU8xXgo4QaKmel
6EJq24hxMrMPaUAWNQfa/qU9HtmHcKzsV3QOtWl1XUEtX/EdFjKQs8PyfxefQpu/
1MeK43g6Uq03teIGS+1ftdVjHxlRLMcG6+knjblrt0pd9m0ai9iZzsVkHnwvgZNg
6LQNpQECgYEA0E7BReWJxOsKl1nKLRKLq07Iw1UpalwramDZRL1blTvgvVZWPNJ/
z3WbhByigYdyNZCA2sPKeFGNgrG9gmhLqSHEBRKs1MSgvTQCnvaL2ZTZYQ2vJd0n
7ynCJbGYo7cXfXo4iaNu4OAyzwmIwhOfScPuXqNAtOnI9VGiCXh4HrMCgYEA0tbP
yuO6kiKngdKDQo0BIZjk2pUk07P/DVFlLJp1DqEQpIEMiqAa6/uu+Rf1otNevMRw
jXXsG/HWY5QaJ4bDjYP0BMVlEce1RASSlrOvhfmHFiOF4xI3kFG8TJ/yfLSpUP1C
zR+1EBxq6LzRNOn94wfuWGHbWebNvhuWJcwrp9ECgYAc23ws6bKfRAxwkTDP86zD
q6NmZArbwC8Hiqkuu6jPUL8+m5JQ1Lx+CgXkVG8y0IfC4eTn6Y3IA0w+Wc8uHLK2
mIXmSgMFasP10hm22eLf3p4KsvGbpjqdCETsIeFKdNfdOyxP7QM0Rfrj8acvc7Zy
aqFAHQ+ewHBlg8yV0UmavwKBgEUJJ2LkrEt7Y2PD3UzmRK+Ok6jq2vMi5emjdEBl
ltyiaoOi6cteX1JTx9gyOzEEium+XKhFK3l+91cFwIaevttQkI8bX1uyC61o3eLQ
lTGGIfBi000lwuHTkZd5a/nfYe1t7/igYDYVSABLCymLUKGNEEMKT7uhMk8EU2au
8sBxAoGBALRjdHWOktFcjp2jMlnhRWOKTlXCqS97G3m4154LTcjE6DnUL4FSNrzu
KHJhGttAWFOZ/d2DXS9SOwyfGXW/5qqXfGxg8p/jVCBO07wKJeuZG8B/S7/Jtvr9
wq1/OAHNzVxt89xk+ECVWE0mrezgnu9BEdeuWTzcmHwFlLHKrc0M
-----END RSA PRIVATE KEY-----
relyt0925 commented 4 years ago

This cert key pair is signed by the second CA. Then you should be able to access the cluster using

#  ETCDCTL_API=3 etcdctl --endpoints=https://localhost:2379 --cert=mycert --key=mykey --cacert=/etc/etcdtls/operator/etcd-tls/etcd-client-ca.crt get --consistency=s foo

{"level":"warn","ts":"2020-01-22T22:53:02.307Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-278c8d86-eb20-4759-b4e2-bb3d4b1fb0cc/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: authentication handshake failed: remote error: tls: bad certificate\""}

But as you can see the server does not accept the request because the second cert in the trust bundle is not recognized as a valid CA.

Curious if this is expected or a known issue?

relyt0925 commented 4 years ago

Note if anyone is experiencing issues with this it can handle CA bundles however the etcd instances have to be explicitly restarted in order to pickup the new cert bundles.

invidian commented 3 years ago

@relyt0925 why this has been closed? I'm experimenting with renewing CA certificates and etcd and it is surprising behavior to me that new peer/server certs are used dynamically, but not CA certificates.

relyt0925 commented 3 years ago

I can reopen as this hasn't been solved. I worked around it by triggering restarts whenever CAs were updated!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

invidian commented 3 years ago

Please not stale bot

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

invidian commented 3 years ago

Not stale.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

invidian commented 3 years ago

Not stale

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

sfotony commented 2 years ago

Still not stale

astromechza commented 2 years ago

I've been looking at solving this in a system I run today. Currently, the next best thing we have is to have a process monitor the CA bundle on disk and then coordinate on gracefully restarting the etcd members - coordinating via etcd itself. This is obviously risky.

It usually looks something like:

  1. wait for file to change
  2. wait for cluster to be healthy and all expected members to be healthy
  3. claim ownership of a key in etcd with a lease of 5 minutes
  4. gracefully close the member
astromechza commented 2 years ago

Naturally this would be far, far, better if it just happened automagically

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

sfotony commented 2 years ago

This is not stale

aauren commented 2 years ago

@gyuho @xiang90 (or one of the other maintainers) - Is there a way to permanently disable the stale bot for this issue?

This is a very real issue that recently hit us in our move to cycle our root certs in our environments. Working around this issue requires a lot of orchestration complexity and is time intensive.

Additionally, it violates the principle of least surprise given that for most other certificate changes, etcd will transparently load them from disk during the next client connection (https://github.com/etcd-io/etcd/pull/7784). However, in this rare case (CA bundles) etcd does not appear to do this and has the potential to cause downtime during CA cutovers.

serathius commented 2 years ago

@aauren Contributions are welcomed.

aauren commented 2 years ago

Definitely willing to take a look, but it would be helpful if the stale bot wasn't constantly trying to close this issue so that it can be properly tracked.

It's already happened once before: https://github.com/etcd-io/etcd/issues/10400

serathius commented 2 years ago

I have started the discussion about the bot in https://github.com/etcd-io/etcd/issues/13775, need to write proposal for issue triage process. Maintainers would mark issue as accepted so bot doesn't close it.

Problem is that this issue was not looked by any contributor/maintainer. Meaning that issue is not as critical that someone would be willing to spend time to fix it.

aauren commented 2 years ago

Ok... After a bit of poking around I think I see why CA certificates aren't reloaded on new connections the same way that certs and client certs are.

The config object for crypto tls allows for GetCertificate and GetClientCertificate to be function based callbacks: https://github.com/golang/go/blob/master/src/crypto/tls/common.go#L557

etcd implements those and uses them to get a fresh copy of the cert and key file from the filesystem each time a new client connection is initiated: https://github.com/etcd-io/etcd/blob/main/client/pkg/transport/listener.go#L408

However, the config object does not expose a similar function based callback for loading CA certificates. For these it only exposes a single attribute: https://github.com/golang/go/blob/master/src/crypto/tls/common.go#L638 that is setup when the config is created.

The only way that I can see to work around this without changing the flow completely would be to implement the getConfigForClient() function (https://github.com/golang/go/blob/master/src/crypto/tls/common.go#L587) which would allow us to re-read the CAs from disk at the same time that we get the certs / client certs and re-initialize the CA certificate pool: https://github.com/etcd-io/etcd/blob/main/client/pkg/transport/listener.go#L486

@serathius would a change like this be acceptable to the project?

Nevermind, I see that https://github.com/etcd-io/etcd/pull/13307 does exactly that and is already in the process of being reviewed.

I'm not sure how I missed that or why that wasn't recommended as it is an almost completed option instead of asking for a contribution. Anyway, I'll monitor the process of that PR.

ptabor commented 2 years ago

FTR: We had related discussion in https://github.com/etcd-io/etcd/pull/13902:

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

invidian commented 1 year ago

Not stale.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

aauren commented 1 year ago

Not stale

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

mrueg commented 1 year ago

not stale, would love to see https://github.com/etcd-io/etcd/pull/13307 or a similar version merged.

oblazek commented 3 months ago

hey! yeah this is a must have. https://github.com/etcd-io/etcd/pull/16500 looks promising, but needs CR.