coder / internal

Non-community issues related to coder/coder
2 stars 0 forks source link

feat: automatic key rotation #52

Open sreya opened 3 weeks ago

sreya commented 3 weeks ago

Problem

We have a few symmetric keys that we use for signing (and also sometimes encrypting) various payloads that don't ever get rotated after creation. We've already encountered some friction with our more security conscious customers concerning our External Provisioners and pre shared keys...and the only reason why we haven't had more pushback on our symmetric key usage is because they are simply unaware of what is happening under the hood.

We already have three features (workspace apps, peer reconnection tokens, and a key used to convert built-in users to oauth) that require key signing and it's possible we may introduce more in the future. We should take the initiative while the debt is somewhat low and implement a system for rotating these internal keys.

Proposal

We will implement a rotation schedule -- configurable by the user -- where keys will be rotated based on an expiration. We should start with a single value that dictates the schedule for all keys. Monthly will be the default. We will spawn a process on startup that checks on same cadence (every 10 minutes?) to see if any keys need to be rotated. If an active key is within 1 hour of its expiration we will create a new key and set it starts_at equivalent to the expiration of the old key.

Implementation Notes

The following are the various token durations for our current signing keys:

Schema Updates

Right now keys are part of the site_config. I propose that we migrate them into their own proper table. The table will be called keys with the following columns.

feature (text) sequence (integer) secret (text) starts_at (timestamptz) deletes_at (timestamptz)

Where the Primary Key is (feature, sequence).

The starts_at column is a bit strange, but since we will be creating keys an hour ahead of time we should avoid using the newer keys until they've been properly propagated.

Considerations

High Availability

The query to insert new keys needs to take HA deployments into consideration. As a result we will use the RepeatableRead isolation level along with some row locking.

Workspace Proxies

We will refetch keys by leveraging our existing RegisterWorkspaceProxyLoop. The loop runs every 15s by default so 1 hour is more than sufficient to ensure proper propagation.

Other Requirements

Implementation

spikecurtis commented 2 weeks ago
deansheather commented 2 weeks ago
spikecurtis commented 2 weeks ago

Why even delete keys? Once expired we should just keep them forever, there aren't going to be that many, and there's currently no reason to delete them since they're only used for signing anyways. If we want a key to expire sooner than originally intended, the expiry date can just be updated to be sooner. If we start using this table for encryption keys then we can add deletion and cronjob to clean stuff up etc. In the event that the key is leaked the admin can delete it like Spike said.

If an attacker gains access to an old ~key~ JWT, they could attempt an attack where they confuse the server's time settings, e.g. by attacking NTP (network time protocol). If we delete old keys this closes down that attack surface.

EDIT: If an attacker has an old key, they could forge new JWTs, and if we delete old keys, that shuts down this attack surface.

spikecurtis commented 2 weeks ago

The UUID should be dropped in favor of the primary key being (feature, seq)

We still need a keyID for the JWT itself, and a UUID is fairly standard.

We could use the (feature, seq) for the keyID, but it's not globally unique, meaning you could get confusing results if the active key is deleted or JWTs cross from one Coder deployment to another. Shouldn't affect security, as the JWT wouldn't validate, but you'd see "signature validation failed" type errors, rather than "unknown key ID".

sreya commented 2 weeks ago

I'd rather keep a simple UUID that we use for the kid header, makes the lookup really straightforward

deansheather commented 2 weeks ago

The benefit of not using a UUID is that the JWTs get smaller. These JWTs are being used in cookies and query parameters where length does matter so it makes sense to try and remove bytes where we can if it's not difficult to do so.

We could use the (feature, seq) for the keyID

Why not just seq? If the sequence is for the whole table then there shouldn't be a problem only using the sequence number as the key ID.

Shouldn't affect security, as the JWT wouldn't validate, but you'd see "signature validation failed" type errors, rather than "unknown key ID".

I don't think we'd even return errors like this anyways, so maybe the log message in this very specific and rare situation would be slightly weird but I don't think that's a big deal.

I'd rather keep a simple UUID that we use for the kid header, makes the lookup really straightforward

There's that much of a difference between SELECT * FROM keys WHERE id = 'uuid' and SELECT * FROM keys WHERE feature = 'something' and seq = 1

deansheather commented 2 weeks ago

oops

sreya commented 2 weeks ago

@spikecurtis @deansheather I edited the post to incorporate a lot of y'all's feedback if you wouldn't mind taking another look.

Why not just seq? If the sequence is for the whole table then there shouldn't be a problem only using the sequence number as the key ID.

If we're going to have a seq column then it'd be odd if it wasn't representative of the key it was tracking...otherwise why not just call it id and use a SERIAL variant?

The benefit of not using a UUID is that the JWTs get smaller. These JWTs are being used in cookies and query parameters where length does matter so it makes sense to try and remove bytes where we can if it's not difficult to do so.

While I agree that you're technically correct we're ultimately talking about ~40 bytes. Do we think that's worth taking into consideration here? I'm ok making the change if we do it just seems pretty minor.

spikecurtis commented 2 weeks ago

Using a sequence number as the database primary key / JWT key ID would be fine. Biggest arguments for UUID is consistency with other tables, biggest argument for sequence number is smaller JWT --- neither is a huge difference, but hearing it out I now have a slight preference for sequence number.

We should not use SERIAL for the sequence number: it has to be computed client-side in a transaction otherwise we don't get the property that it prevents multiple coder servers inserting a new key at the same time.

sreya commented 2 weeks ago

@spikecurtis @deansheather Ok sounds good so are we in agreement to make the primary key (feature, sequence) or are we having a id of type integer instead that we increment? Are we otherwise 👍 on the implementation?

spikecurtis commented 2 weeks ago

deletes_at will be populated when a key is within 1 hour of expiration. It is defined as starts_at + key_duration + token_duration + 1h

I'd amend this slightly to be:

deletes_at will be populated when a new key is inserted for the feature. It is defined as starts_at from the newest key + token_duration + 1h

It's the same idea when things are working, but we don't want deletes_at to be populated if we fail to insert the newly rotated key for some reason, even if the old key is now within 1 hour of "expiration".

spikecurtis commented 2 weeks ago
  • Keys are valid for signing if starts_at < now() < expires_at. This means there should only ever be 1 key active for signing.

Again, that should be true if rotation is successful, but I don't think we should encode the business logic this way. We should always sign with the highest numbered sequence key where starts_at <= now().

  • Keys are valid for verifying if starts_at < now() < deletes_at. It's possible during rotation periods that there are 2 active keys for verification.

Any key where deletes_at == NULL or now() < deletes_at is fine for validation. We shouldn't enforce based on starts_at because we can't assume all servers have exactly the same clock. So, a different server with a slightly faster clock than us could legitimately sign a token with a new key even before we think it's ok to sign with that key.

sreya commented 2 weeks ago

Again, that should be true if rotation is successful, but I don't think we should encode the business logic this way. We should always sign with the highest numbered sequence key where starts_at <= now().

That's true, revised.

Edit: Actually revised now, I'm not sure how it didn't save my edits from before 🤔

sreya commented 2 weeks ago

@deansheather @spikecurtis do you have any preference on which JWT library we use going forward?

deansheather commented 2 weeks ago

IDM what library we use as long as it can do both encrypted and signed tokens. I'd go by API on what you think is better

spikecurtis commented 2 weeks ago

I like the JWT library because it handles creating and validating the industry-standard claims like nbf, exp in addition to signing/verifying signatures/encryption; the JOSE library only handles signing/verifying signatures/encryption.

sreya commented 1 week ago

@deansheather @spikecurtis The JWT library does not handle encryption and the companion library it references doesn't look like something we should depend on, so I think we should go with jose. It has significantly less stars but that's mainly because it was moved from square's org and the migrated repo does seem to be maintained. Thoughts?

spikecurtis commented 1 week ago

We're not encrypting any JWTs, right?

sreya commented 1 week ago

We're encrypting the API key we're smuggling to the wsproxy