Keysync race

Race conditions with key propagation mean it's important that the inner application adopt new state whenever it's received through the keysync procedure.

We're written state transfer as a pull-oriented api on the internal webserver within the enclave. This is better security practice than pushing to a web endpoint, but limits what nitriding can do to notify the inner applation.

I think a race condition implies the inner application needs to continuously poll the /enclave/state endpoint to resolve glitches in state propagation. Below is a step-by-step example to illustrate the issue. If --appcmd is passed to have nitriding launch the inner application, the simplest approach is to restart the application process after every keysync. It can then load the new state just as it does on enclave starup.

In other cases, the inner application must take responsibility for polling for state updates frequently. This is also the only possible approach if the application needs to keep local state distinct from what it receives from the leader. The nitriding daemon could support long polls, websockets, or other push-over-pull schemes to improve the latency of poll-based state propagation.

Details

What follows is a detailed example to illustrate the issue with update convergence.

The leader and worker enclaves must run the same software image so they can attest each other. That means each must start indentically and execute differently based on the environment. For the inner application, leader vs worker status is reflected in the /enclave/state endpoint.

A common start-up procedure would be to poll that endpoint in a loop.

While GET returns 503, wait and poll again; leader designation is in progress.
If GET returns 410, the enclave is the leader:
- Generate an initial state and PUT it to the endpoint for propagation.
- Start answering queries using the new state. On the leader this his harmless. The 410 code might also indicate sync is disabled, so the application must answer regular requests to function as an isolated instance...it will just see its PUT request fail.
If GET returns 200:
- The enclave is a worker. The application should initialize itself from the returned state and start answering queries.

I'll consider the star-randsrv application for the rest of this example.

So, after startup, the nitriding daemons work out if they're a leader, worker, or isolated enclave. The leader's randomness server instance generates an OPRF key, and worker instances have initialized themselves with a copy of the leader's key.

Each key is valid for a fixed number of measurement epochs. Once those are exhausted, the randomness server cannot answer queries until it has a new key. The leader still knows it's the leader, so it can generate a new key and upload it to nitriding for propagation. Workers can begin polling the nitriding daemon to receive the update.

This key rotation happens at the same point in time for all instances. When they start polling, worker instances are likely to receive the old key again. So they must again poll in a loop until they receive a new key, pushed over the network from the leader enclave. Eventually all enclaves will have an updated key and can start answering queries again.

Now consider what happens if the leader goes down between startup and epoch exhaustion. Ideally it's long-lived, but hardware failures do happen. Anyway, it goes offline. Workers can observe this through their heatbeat requests timing out, so one option is they terminate, forcing restart of the whole cluster. However, the workers actually have all the data they need to handle requests for the moment: key rotation might be months away, and they don't actually need the leader until then. So maybe it's better to let them continue on, and just restart the leader, same as a worker would be restarted if it failed. Once the new leader is up, it can receive the worker heartbeats and be to handle propagation.

However, since it must follow the same startup sequence as before, the new leader will generate a new key and push it to the worker enclaves, even though their current keys are still valid. If we're following the least-disruptive pattern of having the randomness server poll nitriding only when it needs new key material, at first it seems like this is actually ok: the worker enclaves have a new key, but the inner randomness server ignores it. Then at epoch exhaustion they get a new key as before.

Before, while we had a race between the leader distributing the new key and the workers adopting it, it was resolved by the workers polling until they saw a key that was different from the one they had. But, since the leader has been started, there are actually three keys in the system. There's the old key the randomness servers have been using, there's the new key propagated to the workers when the leader restarted, and there's the new-new key the leader's randomness server just generated. Different workers could end up initialized from different ones, depending on the ordering of the responses, partitioning the cluster.

Therefore the idea of letting workers continue after the leader fails doesn't really work: there is no way to ignore updates that are unnecessary disruptions but accept those that are important to keeping the cluster consistent.

Workers shutting down when they can't contact the leader is simple, but expensive in terms of downtime. Likely we'll prefer to have the nitriding daemon stay up, but restart the inner application process, or have the inner application continuously poll for state updates and adopt as soon as they're available. That way the cluster always moves toward consistency.

brave / nitriding-daemon

State propagation within the enclave #34

Keysync race

Details