RFD 77 discussion - Githubissues

RFD 77 has been advanced to draft status. This is a place where discussion issues about it can be raised.

In section 5, it is assumed that services will have a mechanism for refreshing their TLS certificates, and each instance of such services will be refreshing at least once per minute.

1) do all of the services with these certificates have the ability to reload ? What happens to requests which "span certificates" (are there any services which need to be able to span more than one certificate refresh ?)

2) What's the anticipated load on the CA-signing service that's signing these ? How many services will be issuing how many requests every minute.

A difficulty here is that any failure on the part of a service to get a fresh signed cert, install it (so that new clients use it) results in the affected service-instance to effectively be unavailable (i.e. its certificates will not pass validation by clients, because they'll presumably have expired.)

Certificates with valid lifetimes as short as a minute seems like a good goal, but I'm concerned about how to make it work in practice. Even relatively short hiccups to the signing service availability become quite serious.

Hi Alex,

Thanks for writing this up and the detail here. I think there's a lot of good stuff here already. Here are some notes from my reading of this and some open questions that I had:

In section 4.2 we talk about the fact that the zones have enough load and cause us to want to scale things, and therefore we should go through and use soft tokens. However, I think a missing piece of this is how the agents fit into this equation. The agents can often be just as busy as some of the services that we have and are often both making requests but also listening to things themselves. At the moment it talks about how agents should use the hardware directly, but it may be that we need to consider soft tokens for all of them, just like the zones and potentially more than one per agent.
I think another important thing to clarify in section 4.2 when we're going to be using mutual TLS auth (both client and server certs) versus when we're not going to be using TLS client certs and instead relying on the HTTP signature algorithm.
In section 4.3 we may want to warn about the risks in running in configurations with less then three different trusted nodes, given that exploiting one of those would allow someone to control the gossip.
In section 4.4 we talk about the trusted head node being the root of the X.509 chain? Was this meant to be a 'trusted CN' or was 'trusted head node' the right level? Similarly, I presume that its the list of trusted CN CAs that they need to keep up to date, given that's what's participating in gossip.
Can there be a future RFD that goes into the gossip protocol in more detail?
I think we may need to have a broader talk about what it means to be a trusted CN or not. For example, take all the agents. They're running on every CN and we want the same ability to construct roots of trust for something like ur, cn-agent, or cmon-agent. There are also some of the offerings in Triton today like Manta that rely on being able to talk to SAPI via config-agent. As such, it feels like it's not just trusted nodes that we're going to need to apply a lot of the verification to. It may be that it doesn't make sense for these to be in the gossip and that the way registrar verifies them is different, but it does seem like something to talk through a bit more.
In section 4.5 can we talk through a bit more of the reasons why we may or may not need transport level encryption when talking to binder for DNS. It'd be useful to talk about how something like dnscurve (for example) would fit into this or why what it offers isn't needed and what we're talking about with TSIG is sufficient.
In section 4.6 is the optical media life times for normal consumer burned discs or archival?
In section 5 we talk about headnode services. It may be clearer to use something like core services or explicitly talk about services that run on trusted CNs. Similarly, given that we're going to have servers that run on non-trusted CNs, we probably need to talk about it here as well.
Similarly, I think we should expand more on when we're generating and using TLS client certificates or when we're using HTTP signed requests.
In section 6.1, point 4b how does registrar connect to binder and how does it verify that binder?
In section 6.1, it seems like CloudAPI will never use a TLS client cert here and will only ever use the SSH key. Is that right?
In section 12.1 we talk about using ChaCha20 and Poly1305 to encrypt files on disk. Is there a path to allowing us to change that in the future if we need to for some reason?
In section 14.2 if we have hardware available that supports say ed25519, will we be able to have a DC that has hardware with different generations of supported algorithms and leverage them?
In section 14.3, if we want to offer new algorithms, will we be able to?

Finally, I think there are some things we should talk about that are a bit more forward looking. It may not be something that we need to talk about explicitly in this RFD, but something we should probably be talking through:

Can we leverage this for cross-DC configurations where we're in the same UFDS? This might be used in the future with respect to IKE or other things that we want to figure out how to authenticate across DCs.
How does Manta fit into here? First, this is in the sense of how does Manta's use of the admin VLAN work here? Secondly, say Manta wanted to use this to boot strap an untrusted Manta network. How would that work?

@timkordas:

In section 5, it is assumed that services will have a mechanism for refreshing their TLS certificates, and each instance of such services will be refreshing at least once per minute.

do all of the services with these certificates have the ability to reload ? What happens to requests which "span certificates" (are there any services which need to be able to span more than one certificate refresh ?)

What's the anticipated load on the CA-signing service that's signing these ? How many services will be issuing how many requests every minute.

Thanks for bringing this up. What I'm thinking is to actually extend the lifetime of some of the key certificates in the chain, since I do think this load is going to be too high with 60s everywhere. A quick summary of my proposal:

Key location	Signed identity	Signed by	Lifetime	Renew at	Positive vouching
Inside zone (key on disk)	Zone (with attribs)	Soft-token	60s	(customer defined)
Soft token (signing key)	Zone UUID	GZ soft-token signing key	120s	+60s
GZ soft-token (signing key)	CN UUID, intermediate	PIV signing key (9c)	120s	+60s
PIV signing key (9c)	CN UUID	CertAPI soft-token signing key	24hr	+6hr	Yes, via CertAPI
CertAPI soft-token signing key	CertAPI Zone (with attribs)	Headnode PIV signing key (9c)	24hr	+6hr
Headnode PIV signing key	Headnode UUID	Self-signed	Infinite		Yes, gossip service

I'm using CertAPI here as a stand-in name for the headnode service that will sign these certificates.

As a general rule, we will refresh certificates half way through their "Lifetime" value here, and we will never hand out a certificate that has less than half the lifetime remaining. This means that e.g. when you in the zone request the entire chain from the soft-token, it may contain a certificate for the soft-token signing key that's up to 60s old (so has a minimum of 60s left on it), and a certificate for the CN that's up to 6hrs old (so has a minimum of 18hrs left on it). This rule ensures that all certs in the whole chain the zone gets will definitely last for 60s, and there is enough time for renewals to happen before the next signing event.

It also means that CertAPI can be down for up to 18 hours before certificate chains begin to expire. We can tweak this further, of course. CertAPI will be closely coupled with CNAPI's data, but should be HA-capable from day 1 so it might be difficult to make it part of the CNAPI zone. It'll likely require moray + manatee to be up in order to operate (so 2/3 of the trusted nodes in the DC), to give a brief feel for its capability to be available.

In terms of load fan-out, if we assume there are, say, 2000 zones on a machine (something we have as a goal but is not currently realistic), then this would imply a maximum signing load of ~33 sign operations per second for the GZ soft-token, and 1 sign operation per 60s for the hardware token. This seems very realistic -- the Yubikeys can do an RSA-2048 sign in about 500-600ms. This leaves plenty of room for them to spend time doing ECDH operations to unlock keys, which take about 60-70ms each.

If we assume there are 5000 CNs in a datacenter, then CertAPI will have to authenticate requests for, and then sign 1 certificate every 5 sec over the network, which also seems fine. There are not many CertAPI instances, so the load generated by the headnodes signing the certs for CNAPIs is fairly negligible.

I've also noted on that table where positive vouching is possible. Positive vouching will always be used for the headnode root CA certs (via the gossip service). Positive vouching can optionally also be used for the long-duration certs in the chain by connecting to CertAPI (or cloudapi from outside on a customer network) and confirming the status of the given CN. I expect this capability could be used for authenticating particularly sensitive or high-impact operations in APIs, for example.

If that seems sensible, I'll incorporate it into the writeup (and clarify the structure of the cert chain more in that section too). I've been talking to ops about this and they seem to want a minimum of ~18hr to react to an outage in a core service before customer workloads start dropping, though they'd like to see this be tweakable. So I guess take the values in the table as defaults.

[this comment took several edits to get right]

What you write above seems sensible to me.

The only related issue I can think of at the moment is it would be nice to have the longer-lived certificate expirations "smeared"/jittered -- such that not all expire at 0:00 UTC or whatever. If the initial generation is done all at once, then most certs tend to expire around the same time, it'd be nice to spread the load of the +6hr around the clock relatively evenly. Perhaps this means doing the initial certificate generation of a 24-hour certificate with expiry some 24 + (Math.random() * 24) initial expiry -- or something similar.

That is, you want to get into a state where you are doing some number of certs every minute; not a situation where on average you are doing that number (in minute-0 at the top of the hour and then idle for the rest of the hour.)

We need some details around CN setup. Presumably that is something like:

PXE boot
sdc-server setup zfs_encryption=true <mumble>

Presumably we want to be able to indicate which keys are allowed for recovery. I'd expect that we don't want to set that up for each CN - maybe there's a DC-wide set of keys that are used and all CNs should converge on that. If it doesn't apply to all CNs, we probably need some way to attach a profile that includes the recovery public keys and their names. Key rotation should take advantage of centralized config.

TritonDataCenter / rfd

RFD 77 discussion #47