divviup / janus

Experimental implementation of the Distributed Aggregation Protocol (DAP) specification.
Mozilla Public License 2.0
51 stars 14 forks source link

Drop support for per-task HPKE configs #2147

Open divergentdave opened 8 months ago

divergentdave commented 8 months ago

We are considering dropping support for per-task HPKE configs to simplify. Note that this will have an impact on a number of unit tests. Global HPKE configs currently can only be set up via the aggregator API, so we may need to make the aggregator API required instead of optional, provide another means to set up a global keypair, or do so automatically on first boot.

inahga commented 8 months ago

Dupe of https://github.com/divviup/janus/issues/1641, I'll let this issue supersede that one.

inahga commented 1 month ago

We have decided to move forward with obsoleting per-task HPKE keys and fully adopting global HPKE keys.

inahga commented 1 month ago

Some notes: We already have support for global keys for the taskprov use case. Functionality should already be present for using these keys. Janus also refreshes its internal cache of these keys every 30 minutes.

We'll need some tweaks to remove if taskprov_enabled, and change the behavior of /hpke_configs to always present global keys. We will need to retain the ability for Janus to trial decrypt using a task-specific key, until the next breaking windows change. Task provisioning will need to be modified to stop creating task-specific keys. Several tests will need to be changed to stop using task-specific keys.

The main missing functionality is that keys currently need to be manually provisioned. This is a non-starter for normal deployments of Janus, of which we run several. We also need to consider that BYOH deployments probably don't want to have to worry about key management.

We need a subprocess key-rotator (to borrow terminology from ENPA) that ensures that an active HPKE key is present, and fully handles the lifecycle of keys. This can run independently for Janus leader use cases, and in the same binary for BYOH use cases. There is a first-start edge case that requires consideration, where an aggregator process starts running before the key-rotator does. A few options here, perhaps the aggregator's cache is always invalid until it has at least one active key.

inahga commented 1 month ago

There is a first-start edge case that requires consideration, where an aggregator process starts running before the key-rotator does. A few options here, perhaps the aggregator's cache is always invalid until it has at least one active key.

This can be accomplished by having GlobalHpkeKeypairCache::new() backoff retry fetching keypairs until it finds at least one active one. This will block the process from fully starting, until the key rotator service inserts an active keypair.

For BYOH use cases where we're supplying an all-in-one binary, the key rotator task should be started before the HPKE cache task. The key rotator should run a sweep immediately on creation.

I didn't really like the "cache always invalid unless one key is active" strategy because the cache logic seems more complicated.

inahga commented 1 month ago

I've sketched out this logic for the key rotator (may have errors, let me know if anything stands out as wrong).

Configuration:

struct Options {
    /// How often to run the key rotator, in seconds
    pub frequency_s: u64,
    pub hpke: HpkeOptions,
}

struct HpkeOptions {
    // (these can be named better)
    /// How long to wait before promoting a keypair from PENDING to ACTIVE. This should correlate to
    /// how often Janus updates its global HPKE cache, i.e. 30 minutes
    pub pending_to_active_s: u64,
    /// How long until an active key should be considered expired, i.e. the age of an HPKE key
    pub active_to_expired_s: u64,
    /// How long an expired key should be kept around. This should correlate to how long clients are
    /// expected to cache an HPKE key.
    pub expired_to_deleted_s: u64,

    /// A list of ciphersuites to ensure have an active HPKE keypair. Can be multiple, if we want
    /// to provide support for multiple ciphersuites, otherwise it can just be whatever we want.
    pub ciphers: Vec<HpkeCiphersuite>,
}

struct HpkeCiphersuite(KEM, KDF, AEAM);

Key rotator logic:

In database transaction:
    Acquire ExclusiveLock on global_hpke_keys (n.b. this does not block readers)

    Get current keys
    For each HPKE ciphersuite:
        If there is no active key fitting the cipher suite (bootstrapping logic):
            Insert an active key
        Else:
             If the active key is too old:
                 If there is a pending key fitting the ciphersuite:
                     If the pending key has been pending long enough:
                          Make the pending key active
                          Expire the active key
                      Else do nothing (we'll have to wait longer before taking action)
                 Else insert a pending key
             Else do nothing (the active key is fine)

     For each expired key:
          If the expired key is old enough, delete it.

The HPKE ID is chosen at random, non-conflicting with any IDs currently in the table. It's possible to paint ourselves into a corner e.g. given too many ciphersuites, too long of expiration relative to HPKE keypair age, so it's important to choose the right parameters to avoid running out of u8 space.

Age of the key and the time of its state changes are tracked by the extant updated_at and created_at columns on global_hpke_keys.

Note we take an ExclusiveLock on the table such that only one key rotator replica can read and write to the table at a time. This shouldn't block any cache updates from aggregator replicas. We need to do this because we have no uniqueness constraint on the HPKE ciphersuite--multiple concurrent key rotators would otherwise be able to insert multiple pending keys for instance.

divergentdave commented 1 month ago

We should either list the global keys again just before finishing the database transaction and check that only our new key was added, or run the database transaction at the SERIALIZABLE level, to prevent two processes creating keys at the same time. Nevermind, locking the table should prevent that.