Open sreya opened 3 weeks ago
signing_keys
name would preclude using keys for encryption at a later date. Can we just call it keys
?value
column should be named something like secret
so that it's obvious that it shouldn't be logged, and also allows us to extend this table to asymmetric keys in futureexpires_at
as a column, and compute the time to rotate based on starts_at
and the rotation policy. That way, if you change the policy to be shorter, you get rotation right away. E.g. if I have rotation every month and then change it to every week, I shouldn't wait until the original month is up before rotating.deletes_at
time. When its time to rotate
deletes_at=NULL
,deletes_at
to current time + longest token expiry + 1 hour (e.g. WorkspaceApps keys will delete 65minutes after they are rotated out)deletes_at
in the pastsequence_number
, and a unique constraint on sequence_number
and feature
. In a transaction, we increment the sequence_number by one each time we rotate: this prevents multiple Coder servers from rotating at the same time.(feature, seq)
Why even delete keys? Once expired we should just keep them forever, there aren't going to be that many, and there's currently no reason to delete them since they're only used for signing anyways. If we want a key to expire sooner than originally intended, the expiry date can just be updated to be sooner. If we start using this table for encryption keys then we can add deletion and cronjob to clean stuff up etc. In the event that the key is leaked the admin can delete it like Spike said.
If an attacker gains access to an old ~key~ JWT, they could attempt an attack where they confuse the server's time settings, e.g. by attacking NTP (network time protocol). If we delete old keys this closes down that attack surface.
EDIT: If an attacker has an old key, they could forge new JWTs, and if we delete old keys, that shuts down this attack surface.
The UUID should be dropped in favor of the primary key being (feature, seq)
We still need a keyID for the JWT itself, and a UUID is fairly standard.
We could use the (feature, seq) for the keyID, but it's not globally unique, meaning you could get confusing results if the active key is deleted or JWTs cross from one Coder deployment to another. Shouldn't affect security, as the JWT wouldn't validate, but you'd see "signature validation failed" type errors, rather than "unknown key ID".
I'd rather keep a simple UUID that we use for the kid
header, makes the lookup really straightforward
The benefit of not using a UUID is that the JWTs get smaller. These JWTs are being used in cookies and query parameters where length does matter so it makes sense to try and remove bytes where we can if it's not difficult to do so.
We could use the (feature, seq) for the keyID
Why not just seq? If the sequence is for the whole table then there shouldn't be a problem only using the sequence number as the key ID.
Shouldn't affect security, as the JWT wouldn't validate, but you'd see "signature validation failed" type errors, rather than "unknown key ID".
I don't think we'd even return errors like this anyways, so maybe the log message in this very specific and rare situation would be slightly weird but I don't think that's a big deal.
I'd rather keep a simple UUID that we use for the kid header, makes the lookup really straightforward
There's that much of a difference between SELECT * FROM keys WHERE id = 'uuid'
and SELECT * FROM keys WHERE feature = 'something' and seq = 1
oops
@spikecurtis @deansheather I edited the post to incorporate a lot of y'all's feedback if you wouldn't mind taking another look.
Why not just seq? If the sequence is for the whole table then there shouldn't be a problem only using the sequence number as the key ID.
If we're going to have a seq
column then it'd be odd if it wasn't representative of the key it was tracking...otherwise why not just call it id
and use a SERIAL
variant?
The benefit of not using a UUID is that the JWTs get smaller. These JWTs are being used in cookies and query parameters where length does matter so it makes sense to try and remove bytes where we can if it's not difficult to do so.
While I agree that you're technically correct we're ultimately talking about ~40 bytes. Do we think that's worth taking into consideration here? I'm ok making the change if we do it just seems pretty minor.
Using a sequence number as the database primary key / JWT key ID would be fine. Biggest arguments for UUID is consistency with other tables, biggest argument for sequence number is smaller JWT --- neither is a huge difference, but hearing it out I now have a slight preference for sequence number.
We should not use SERIAL
for the sequence number: it has to be computed client-side in a transaction otherwise we don't get the property that it prevents multiple coder servers inserting a new key at the same time.
@spikecurtis @deansheather Ok sounds good so are we in agreement to make the primary key (feature, sequence)
or are we having a id of type integer
instead that we increment? Are we otherwise 👍 on the implementation?
deletes_at will be populated when a key is within 1 hour of expiration. It is defined as starts_at + key_duration + token_duration + 1h
I'd amend this slightly to be:
deletes_at will be populated when a new key is inserted for the feature. It is defined as starts_at
from the newest key + token_duration
+ 1h
It's the same idea when things are working, but we don't want deletes_at
to be populated if we fail to insert the newly rotated key for some reason, even if the old key is now within 1 hour of "expiration".
- Keys are valid for signing if starts_at < now() < expires_at. This means there should only ever be 1 key active for signing.
Again, that should be true if rotation is successful, but I don't think we should encode the business logic this way. We should always sign with the highest numbered sequence key where starts_at <= now()
.
- Keys are valid for verifying if starts_at < now() < deletes_at. It's possible during rotation periods that there are 2 active keys for verification.
Any key where deletes_at == NULL
or now() < deletes_at
is fine for validation. We shouldn't enforce based on starts_at
because we can't assume all servers have exactly the same clock. So, a different server with a slightly faster clock than us could legitimately sign a token with a new key even before we think it's ok to sign with that key.
Again, that should be true if rotation is successful, but I don't think we should encode the business logic this way. We should always sign with the highest numbered sequence key where starts_at <= now().
That's true, revised.
Edit: Actually revised now, I'm not sure how it didn't save my edits from before 🤔
@deansheather @spikecurtis do you have any preference on which JWT library we use going forward?
IDM what library we use as long as it can do both encrypted and signed tokens. I'd go by API on what you think is better
I like the JWT library because it handles creating and validating the industry-standard claims like nbf
, exp
in addition to signing/verifying signatures/encryption; the JOSE library only handles signing/verifying signatures/encryption.
@deansheather @spikecurtis The JWT library does not handle encryption and the companion library it references doesn't look like something we should depend on, so I think we should go with jose
. It has significantly less stars but that's mainly because it was moved from square's org and the migrated repo does seem to be maintained. Thoughts?
We're not encrypting any JWTs, right?
We're encrypting the API key we're smuggling to the wsproxy
Problem
We have a few symmetric keys that we use for signing (and also sometimes encrypting) various payloads that don't ever get rotated after creation. We've already encountered some friction with our more security conscious customers concerning our External Provisioners and pre shared keys...and the only reason why we haven't had more pushback on our symmetric key usage is because they are simply unaware of what is happening under the hood.
We already have three features (workspace apps, peer reconnection tokens, and a key used to convert built-in users to oauth) that require key signing and it's possible we may introduce more in the future. We should take the initiative while the debt is somewhat low and implement a system for rotating these internal keys.
Proposal
We will implement a rotation schedule -- configurable by the user -- where keys will be rotated based on an expiration. We should start with a single value that dictates the schedule for all keys. Monthly will be the default. We will spawn a process on startup that checks on same cadence (every 10 minutes?) to see if any keys need to be rotated. If an active key is within 1 hour of its expiration we will create a new key and set it
starts_at
equivalent to the expiration of the old key.Implementation Notes
starts_at
+key_duration
, wherekey_duration
is a value provided at runtime by the user.deletes_at
will be populated when a new key is inserted for the feature. It is defined asstarts_at
from the newest key +token_duration
+1h
.now()
<deletes_at
ordeletes_at == NULL
.starts_at
<=now()
<deletes_at
.deletes_at
we will set thesecret
field to NULL.The following are the various token durations for our current signing keys:
Schema Updates
Right now keys are part of the
site_config
. I propose that we migrate them into their own proper table. The table will be calledkeys
with the following columns.Where the Primary Key is
(feature, sequence)
.The
starts_at
column is a bit strange, but since we will be creating keys an hour ahead of time we should avoid using the newer keys until they've been properly propagated.Considerations
High Availability
The query to insert new keys needs to take HA deployments into consideration. As a result we will use the
RepeatableRead
isolation level along with some row locking.Workspace Proxies
We will refetch keys by leveraging our existing RegisterWorkspaceProxyLoop. The loop runs every 15s by default so 1 hour is more than sufficient to ensure proper propagation.
Other Requirements
dbcrypt
Implementation
coderd/keyrotate
packagecoderd/keysigning
packagecrypto_keys
table and implement remaining glue