Implement key synchronization.

NullHypothesis commented 1 year ago

Resolves #10

rillian commented 1 year ago

Looks like a good start. Initial thoughts:

The attester interface seems generally useful. Could land it separately to reduce the size of this PR.

In addition to reporting update failures, the leader should probably drop workers from its list if it can't update them, to handle instances that have failed. We probably also want some guards against stale nodes, especially since key rotation might happen no more than every few weeks. Maybe workers should re-register periodically, as a keep-alive against expiry from the leader's list. Likewise, workers should check their key material against the leader periodically and terminate if they haven't received an update. Maybe that could be combined into some sort of keep-alive ping?

NullHypothesis commented 1 year ago

Quick summary of where we are with the PR:

From the leader's PoV:
- Upon receiving a worker's registration, the leader immediately initiates key synchronization.
- Upon receiving a worker's heartbeat, the leader updates the worker's "last seen" timestamp and lets the worker know if its key material is up-to-date.
- Upon receiving new keys from star-randsrv, the leader immediately re-synchronizes with all registered workers.
- If a given worker hasn't sent a heartbeat in X minutes, the leader logs an error and removes it from the worker pool.
From the worker's PoV:
- Immediately after bootstrapping, the worker registers itself with the leader.
- Every X minutes, the worker sends a heartbeat to the leader, containing a hash over its key material. If the leader signals that the key material is outdated, the worker re-registers itself.
- If key synchronization fails, the worker terminates.
- If the leader is temporarily unavailable for the heartbeat, the worker logs an error.

What remains to be done:

Test key sync in the context of k8s.
Log important errors via Prometheus, so we know when there are sync issues.
Write more tests.
Provide a mechanism that lets star-randsrv know when its keys were updated.

Also, the scripts/ directory contains a few shell scripts that help with testing key synchronization locally.

rillian commented 1 year ago

Are there advantages to having separate registration and heartbeat endpoints? If the initial registration request contained an empty body (or a hash of null key material), the leader could use the same logic to schedule a key exchange. When workers send subsequent registration requests with a current key hash, that could work the same as a heartbeat, updating the leader's list of workers without triggering an immediate keysync.

NullHypothesis commented 1 year ago

Are there advantages to having separate registration and heartbeat endpoints? If the initial registration request contained an empty body (or a hash of null key material), the leader could use the same logic to schedule a key exchange. When workers send subsequent registration requests with a current key hash, that could work the same as a heartbeat, updating the leader's list of workers without triggering an immediate keysync.

That's a good observation. Fixed in 697f033.

NullHypothesis commented 1 year ago

@kdenhartog IIUC, all your remarks should now be addressed. Let me know if there's anything else that requires clarification!

NullHypothesis commented 1 year ago

Removing the "work-in-progress" because it no longer is. (cc @rillian, @DJAndries)

NullHypothesis commented 1 year ago

Let's merge and address future issues in subsequent PRs.

brave / nitriding-daemon

Implement key synchronization. #32