TritonDataCenter / rfd

Requests for Discussion
Mozilla Public License 2.0
260 stars 77 forks source link

RFD 77 discussion #47

Open arekinath opened 7 years ago

arekinath commented 7 years ago

RFD 77 has been advanced to draft status. This is a place where discussion issues about it can be raised.

timkordas commented 7 years ago

In section 5, it is assumed that services will have a mechanism for refreshing their TLS certificates, and each instance of such services will be refreshing at least once per minute.

1) do all of the services with these certificates have the ability to reload ? What happens to requests which "span certificates" (are there any services which need to be able to span more than one certificate refresh ?)

2) What's the anticipated load on the CA-signing service that's signing these ? How many services will be issuing how many requests every minute.

A difficulty here is that any failure on the part of a service to get a fresh signed cert, install it (so that new clients use it) results in the affected service-instance to effectively be unavailable (i.e. its certificates will not pass validation by clients, because they'll presumably have expired.)

Certificates with valid lifetimes as short as a minute seems like a good goal, but I'm concerned about how to make it work in practice. Even relatively short hiccups to the signing service availability become quite serious.

rmustacc commented 7 years ago

Hi Alex,

Thanks for writing this up and the detail here. I think there's a lot of good stuff here already. Here are some notes from my reading of this and some open questions that I had:

Finally, I think there are some things we should talk about that are a bit more forward looking. It may not be something that we need to talk about explicitly in this RFD, but something we should probably be talking through:

arekinath commented 6 years ago

@timkordas:

In section 5, it is assumed that services will have a mechanism for refreshing their TLS certificates, and each instance of such services will be refreshing at least once per minute.

do all of the services with these certificates have the ability to reload ? What happens to requests which "span certificates" (are there any services which need to be able to span more than one certificate refresh ?)

What's the anticipated load on the CA-signing service that's signing these ? How many services will be issuing how many requests every minute.

Thanks for bringing this up. What I'm thinking is to actually extend the lifetime of some of the key certificates in the chain, since I do think this load is going to be too high with 60s everywhere. A quick summary of my proposal:

Key location Signed identity Signed by Lifetime Renew at Positive vouching
Inside zone (key on disk) Zone (with attribs) Soft-token 60s (customer defined)
Soft token (signing key) Zone UUID GZ soft-token signing key 120s +60s
GZ soft-token (signing key) CN UUID, intermediate PIV signing key (9c) 120s +60s
PIV signing key (9c) CN UUID CertAPI soft-token signing key 24hr +6hr Yes, via CertAPI
CertAPI soft-token signing key CertAPI Zone (with attribs) Headnode PIV signing key (9c) 24hr +6hr
Headnode PIV signing key Headnode UUID Self-signed Infinite Yes, gossip service

I'm using CertAPI here as a stand-in name for the headnode service that will sign these certificates.

As a general rule, we will refresh certificates half way through their "Lifetime" value here, and we will never hand out a certificate that has less than half the lifetime remaining. This means that e.g. when you in the zone request the entire chain from the soft-token, it may contain a certificate for the soft-token signing key that's up to 60s old (so has a minimum of 60s left on it), and a certificate for the CN that's up to 6hrs old (so has a minimum of 18hrs left on it). This rule ensures that all certs in the whole chain the zone gets will definitely last for 60s, and there is enough time for renewals to happen before the next signing event.

It also means that CertAPI can be down for up to 18 hours before certificate chains begin to expire. We can tweak this further, of course. CertAPI will be closely coupled with CNAPI's data, but should be HA-capable from day 1 so it might be difficult to make it part of the CNAPI zone. It'll likely require moray + manatee to be up in order to operate (so 2/3 of the trusted nodes in the DC), to give a brief feel for its capability to be available.

In terms of load fan-out, if we assume there are, say, 2000 zones on a machine (something we have as a goal but is not currently realistic), then this would imply a maximum signing load of ~33 sign operations per second for the GZ soft-token, and 1 sign operation per 60s for the hardware token. This seems very realistic -- the Yubikeys can do an RSA-2048 sign in about 500-600ms. This leaves plenty of room for them to spend time doing ECDH operations to unlock keys, which take about 60-70ms each.

If we assume there are 5000 CNs in a datacenter, then CertAPI will have to authenticate requests for, and then sign 1 certificate every 5 sec over the network, which also seems fine. There are not many CertAPI instances, so the load generated by the headnodes signing the certs for CNAPIs is fairly negligible.

I've also noted on that table where positive vouching is possible. Positive vouching will always be used for the headnode root CA certs (via the gossip service). Positive vouching can optionally also be used for the long-duration certs in the chain by connecting to CertAPI (or cloudapi from outside on a customer network) and confirming the status of the given CN. I expect this capability could be used for authenticating particularly sensitive or high-impact operations in APIs, for example.

If that seems sensible, I'll incorporate it into the writeup (and clarify the structure of the cert chain more in that section too). I've been talking to ops about this and they seem to want a minimum of ~18hr to react to an outage in a core service before customer workloads start dropping, though they'd like to see this be tweakable. So I guess take the values in the table as defaults.

[this comment took several edits to get right]

timkordas commented 6 years ago

What you write above seems sensible to me.

The only related issue I can think of at the moment is it would be nice to have the longer-lived certificate expirations "smeared"/jittered -- such that not all expire at 0:00 UTC or whatever. If the initial generation is done all at once, then most certs tend to expire around the same time, it'd be nice to spread the load of the +6hr around the clock relatively evenly. Perhaps this means doing the initial certificate generation of a 24-hour certificate with expiry some 24 + (Math.random() * 24) initial expiry -- or something similar.

That is, you want to get into a state where you are doing some number of certs every minute; not a situation where on average you are doing that number (in minute-0 at the top of the hour and then idle for the rest of the hour.)

mgerdts commented 5 years ago

We need some details around CN setup. Presumably that is something like:

  1. PXE boot
  2. sdc-server setup zfs_encryption=true <mumble>

Presumably we want to be able to indicate which keys are allowed for recovery. I'd expect that we don't want to set that up for each CN - maybe there's a DC-wide set of keys that are used and all CNs should converge on that. If it doesn't apply to all CNs, we probably need some way to attach a profile that includes the recovery public keys and their names. Key rotation should take advantage of centralized config.