Store token hashes instead of tokens

@jbr's comment in https://github.com/divviup/divviup-api/issues/195 made me think that we should change how Janus stores aggregator and collector authentication tokens so that we only store a cryptographic hash of the token instead of the actual token. This is a bit tricky to do, because currently we use the aggregator_auth_tokens table to store both tokens that our leader would present to a helper (in which case we need to store the actual token) as well as tokens that our helper would use to authenticate requests from another leader (in which case we only need the hash). So we would have to enrich the representation of tokens in Janus to distinguish between tokens we present and tokens we accept.

@jbr points out that if Janus helpers no longer store the actual aggregator and collector auth tokens, then Janus/Divvi Up need a story for allowing subscribers to recover from losing a collector auth token or allowing a peer aggregator to recover from losing an aggregator auth token, without forcing the rotation of a task, which could be very expensive if it has been deployed to many clients.

One solution strategy would be for Janus' aggregator API to expose some endpoints for rotating the auth tokens associated with a task. Then, if a subscriber loses access to a token, they can revoke the old one and mint a new one.

I am pretty well convinced by @jbr's point about the need to enable recovery from losing a collector authentication token. To that end, I think the Janus aggregator API should support rotation of collector authentication tokens. If we go do that, I think we may as well do it for aggregator auth tokens, too, because they'll mostly work the same.

The big open question here is how much disruption do we want to take on here. Storing collector auth token hashes instead of the unhashed token forces changes on the Divvi Up subscriber, because the existing GET /tasks/:task_id/collector_auth_tokens route on divviup-api can't work anymore (Janus will no longer have the token in the clear and so can't service that request). If we're going that far, should we take on the notion of "collector credentials" we discussed previously (https://github.com/divviup/divviup-api/issues/487), and make it the subscriber's responsibility to generate collector auth tokens and then provide divvivup-api the HPKE config + collector auth token hash?

I'm going to proceed on the assumption that we do not want to introduce collector credentials, as I think it can be added after the changes I'm discussing here.

Collector auth tokens

These are used by a Janus leader to authenticate incoming collection protocol requests. There may be many of these for each task. Any non-revoked token is valid.

API routes

Janus aggregator API

We add a resource for collector auth tokens with the following routes. Collector auth tokens are uniquely identified by a UUID (they're also uniquely identified by an integer primary key but that is never exposed outside of Janus). These are only usable if the aggregator's role in the task is leader. If not, requests fail with 400 Bad Request.

GET /tasks/:task_id/collector_auth_tokens returns a list of active collector auth token IDs.

POST /tasks/:task_id/collector_auth_tokens instructs the leader to generate a new collector auth token and assign it an ID. The leader stores the hash of the token its its database. The response is 201 CREATED consists of the unhashed token and its ID.

This is the only way to create collector auth tokens (see discussion of POST /tasks, below).

Note: If we choose to do collector credentials, then the leader would no longer be responsible for generating collector auth tokens. Instead, this request would have a body containing the auth token hash.

DELETE /tasks/:task_id/collector_auth_tokens/:token_id instructs the leader to revoke the specified collector auth token. The response is 204 No Content.

QUESTION: is PATCH better? QUESTION: Should revocation be immediate, or delayed by some interval to allow for in-flight requests to resolve?

POST /tasks will no longer generate a collector auth token. That must be done after creating the task with a request to POST /tasks/:task_id/collector_auth_tokens. Accordingly we remove collector_auth_token from TaskResp.

This is for two reasons:

better reflects how divviup-api uses this anyway, since it has distinct API routes for creating tasks and obtaining the collector auth token, with the latter only being called when the user clicks "show me the token" in the UI.
auth token creation only happens in one place, which is easier to reason about.

Note: If we choose to collector credentials, then this decision would change.

`divviup-api`

GET /tasks/:task_id/collector_auth_tokens calls the corresponding endpoint on the Janus aggregator API and returns the list of token IDs. Since Janus no longer stores unhashed auht tokens, it can't return the actual tokens anymore.

POST /tasks/:task_id/collector_auth_tokens calls the corresponding endpoint on the Janus aggregator API and returns the auth token and its ID.

DELETE /tasks/:task_id/collector_auth_tokens/:token_id calls the corresponding endpoint on the Janus aggregator API.

Database

In the Janus database:

-- The collector authentication tokens used by a given task.
CREATE TABLE task_collector_auth_tokens(
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- artificial ID, internal-only
    token_id UUID NOT NULL,                            -- external unique ID for the token
    task_id BIGINT NOT NULL,                           -- task ID the token is associated with
    ord BIGINT NOT NULL,                               -- a value used to specify the ordering of the authentication tokens
    type AUTH_TOKEN_TYPE NOT NULL DEFAULT 'BEARER',    -- the type of the authentication token
    token_hash BYTEA NOT NULL,                         -- SHA-256 of the token
    created_at TIMESTAMP NOT NULL,                     -- when the token was created
    revoked_at TIMESTAMP,                              -- when the token was revoked. If NULL, token is still valid.

    CONSTRAINT task_collector_auth_tokens_unique_task_id_and_ord UNIQUE(task_id, ord),
    CONSTRAINT fk_task_id FOREIGN KEY(task_id) REFERENCES tasks(id) ON DELETE CASCADE
);

Changes from the existing representation:

adds token_id UUID to store external unique ID
column token is now token_hash as we now only store token hashes
new columns created_at, revoked_at

Aggregator auth tokens

These are used by a Janus helper to authenticate incoming aggregation protocol requests, and by a Janus leader to authenticate outgoing aggregation protocol requests. In the former case, we store only token hashes. In the latter, we store unhashed tokens.

A helper may have many tokens for a task and any currently valid token may be used. A leader may also have many tokens but will only ever use the most recent one.

API routes

Janus aggregator API

We add a resource for aggregator auth tokens. Aggregator auth tokens are uniquely identified by a UUID, just like collector auth tokens. The resource is usable for either aggregator role, but with some differences in request bodies and responses.

GET /tasks/:task_id/aggregator_auth_tokens returns a list of active aggregator auth token IDs. It behaves the same for both roles.

On the Helper, POST /tasks/:task_id/aggregator_auth_tokens has no body and instructs the helper to generate a new aggregator auth token and assign it a unique ID. The token hash is stored in the database. The unhashed token and its ID are returned to the caller with status 201 Created.

On the Leader, POST /tasks/:task_id/aggregator_auth_tokens has a body consisting of the unhashed token. The leader stores the unhashed token in its database and returns status 201 Created with no body.

QUESTION: the request could also include a token ID, which would allow divviup-api to ensure token IDs are consistent across leader and helper. That could help with debugging.

DELETE /tasks/:task_id/aggregator_auth_tokens/:token_id instructs the aggregator to revoke the specified auth token. Returns 204 No Content.

`divviup-api`

POST /tasks/:task_id/aggregator_auth_token instructs the control plane to coordinate rotation of the aggregator auth token across the leader and helper. The request has no body. Upon receiving this request, divviup-api will:

Do GET /tasks/:task_id/aggregator_auth_tokens on both aggregators to discover active auth tokens;
Do POST /tasks/:task_id/aggregator_auth_tokens on the helper to generate a new aggregator auth token;
Do POST /tasks/:task_id/aggregator_auth_tokens on the leader to provide it the token just generated by the helper;
Do DELETE /tasks/:task_id/aggregator_auth_tokens/:token_id on any still active toksn in either aggregator

Database

-- The aggregator authentication tokens used by a given task.
CREATE TABLE task_aggregator_auth_tokens(
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- artificial ID, internal-only
    token_id UUID NOT NULL,                            -- external unique ID for the token
    task_id BIGINT NOT NULL,                           -- task ID the token is associated with
    ord BIGINT NOT NULL,                               -- a value used to specify the ordering of the authentication tokens
    type AUTH_TOKEN_TYPE NOT NULL DEFAULT 'BEARER',    -- the type of the authentication token
    token BYTEA NOT NULL,                              -- for the leader, bearer token used to authenticate messages to/from the other aggregator (encrypted)
                                                       -- for the helper, SHA-256 of the token.
    created_at TIMESTAMP NOT NULL,                     -- when the token was created
    revoked_at TIMESTAMP,                              -- when the token was revoked. If NULL, token is still valid.

    CONSTRAINT task_aggregator_auth_tokens_unique_task_id_and_ord UNIQUE(task_id, ord),
    CONSTRAINT fk_task_id FOREIGN KEY(task_id) REFERENCES tasks(id) ON DELETE CASCADE
);

Changes from the existing representation:

adds token_id UUID to store external unique ID
column token can now store either an unhashed auth token OR the SHA-256 hash of a token, depending on the role played in the task by this aggregator
new columns created_at, revoked_at

Note: It's not great that column token's contents are different based on the role. I considered having two nullable columns token and token_hash, where only one or the other would be set based on task role, or adding a boolean column indicating what the token column contains. However, those solutions admit representations of illegal states (e.g., if the role in the task is helper, then we'd expect only token_hash to be non-null, but the implementation has to check that).

Some questions on context, apologies if I missed it in slack or other issues.

What's the threat model and/or improvements that justify token hashing? Not saying that is is an invalid thing to do, it's moreso me wanting to avoid making assumptions--it would also help us evaluate how much disruption or UX compromise is acceptable.

If we're going that far, should we take on the notion of "collector credentials" we discussed previously (https://github.com/divviup/divviup-api/issues/487), and make it the subscriber's responsibility to generate collector auth tokens and then provide divvivup-api the HPKE config + collector auth token hash?

As long as the UX is such that the subscriber does not have to juggle two tokens. To that end perhaps we should adopt https://github.com/divviup/divviup-api/issues/455?

QUESTION: is PATCH better?

For deletion, I don't think so. What would the caller call PATCH with to communicate that the token should be revoked? They can't do revoked_at because it would would just be ignored (we might be tempted to allow an expireable token, but with task expiration I'm not sure this makes sense).

Should revocation be immediate, or delayed by some interval to allow for in-flight requests to resolve?

IMO revocation should be immediate because the worst case scenario that justifies token revocation is that the token has been leaked and is being exploited.

What requests would be in-flight? I can only think of collector polling. It would indeed cut them off--but an operator doing a controlled token rotation will always create a new token, apply it to their environment (presumably a collector restart would be graceful), then delete the old one so I don't think this is a real risk.

the request could also include a token ID, which would allow divviup-api to ensure token IDs are consistent across leader and helper. That could help with debugging.

Yes!

Note that token rotation is complicated by the fact that Janus caches task data indefinitely. https://github.com/divviup/janus/issues/238. Perhaps highly complicated.

Database questions: Why not have the token_id be the primary key and omit id? (I think you would need an index on token_id anyway, but I'm not sure off-hand what EXPLAIN ANALYZE would say about looking up tokens)

Presumably the token hash will be stored in plaintext in the database? I don't think there's any value in encrypting a hash. For any values that are stored in plaintext, I think the ord parameter becomes unnecessary? IIRC we only use it for the AAD during encryption, although I defer to other's judgement on that.

However, those solutions admit representations of illegal states (e.g., if the role in the task is helper, then we'd expect only token_hash to be non-null, but the implementation has to check that).

IMO I think we should have token and token_hash--overloading columns can be confusing. I believe we can use a check constraint to deny an invalid state, or worst case can use a trigger. I'd need to read the manual and play around a bit to be sure though. It's debatable whether we want to encode that kind of business logic into the database though.

This whole proposal is obsolete because DAP forbids rotating collector HPKE configs associated with a task, and I don't wish to revisit that decision.

Answering questions that are still relevant given that we're not doing rotation:

What's the threat model and/or improvements that justify token hashing?

Good point and well worth articulating. We don't want an attacker with read access to the Janus or divviup-api databases to be able to exfiltrate a credential and subsequently impersonate either the DAP leader or collector. If we store a hash of the auth token, then a request recipient can still authenticate messages. Naturally the credential still gets exposed to the recipient because it'll appear in HTTP headers, but those only exist in memory and not in a persistent database.

As long as the UX is such that the subscriber does not have to juggle two tokens. To that end perhaps we should adopt divviup/divviup-api#455?

Yeah, I believe we've agreed in Slack to make headway toward the reusable collector credential.

the request could also include a token ID, which would allow divviup-api to ensure token IDs are consistent across leader and helper. That could help with debugging.

Yes!

👍🏻

Database questions: Why not have the token_id be the primary key and omit id? (I think you would need an index on token_id anyway, but I'm not sure off-hand what EXPLAIN ANALYZE would say about looking up tokens)

I don't have a good source handy just now for why this is a good practice but if nothing else we should do it this way for consistency with other tables. If we want to revisit whether to get rid of autoincrementing integer primary keys in Janus tables, we should do that separately from this discussion.

Presumably the token hash will be stored in plaintext in the database? I don't think there's any value in encrypting a hash. For any values that are stored in plaintext, I think the ord parameter becomes unnecessary? IIRC we only use it for the AAD during encryption, although I defer to other's judgement on that.

Yes, the token hashes don't need to be stored encrypted. I'm not sure whether the ord parameter can be omitted.

IMO I think we should have token and token_hash--overloading columns can be confusing. I believe we can use a check constraint to deny an invalid state, or worst case can use a trigger. I'd need to read the manual and play around a bit to be sure though. It's debatable whether we want to encode that kind of business logic into the database though.

Good point, we can likely write a constraint that enforces this safely.

I think one important thing that's emerged from today's discussions in Slack is that we will not be supporting rotation of task-associated secrets like HPKE configs and auth tokens. That means that as argued in #1521, we should collapse the auth token tables into tasks.

I believe we have all the Janus changes we want in place, but I'm going to keep this open until we prove that this works e2e with divviup-api, as the changes on that side have not landed yet.

We have https://github.com/divviup/divviup-api/issues/542 to track divviup-api changes, and bugfixes to resolve impedance mismatches can follow in later issues/PRs. Closing this.

divviup / janus