Open mkmik opened 4 years ago
CC: @jjo
@mkmik lifecycle LGTM, it would be also great to have the keys annotated with timestamps for its relevant events:
t0
: key creationt1
: key promotion to "current"t2
: key demotion
, so that cluster operators could then potentially verify that backups did happen between [t0, t1]
, to be able to assert no "uncovered" backup period happened.
The suggested backup workflow works a-ok in practice but suffers from a race condition. It also doesn't resonate with a lot of people (possibly because they intuit such a race condition or possibly just because or doc is bad).
The race condition is the following:
kubeseal
(or fetches a cert withkubeseal --fetch-cert
)If there is not enough time to run a backup between step 2 and 4, there might be one or more sealed secrets encrypted with a key that hasn't been backed up yet (and thus cannot be recovered after the disaster).
That's not usually a big problem if you run frequent backups (daily) or if you synchronize your backup schedule with the key renewal schedule, but the race condition is still there, albeit for a short amount of time.
Sealed secrets works with many keys, and while the youngest key is a bit special because it's the one used by default when sealing a new secret with
kubeseal
, you can seal new secrets by using any of the existing keys as well; that can happen since people can use the "offline" method and pass the certificate explicitly with--cert
.The latest (most recent) key is special because it's the most secure (the less likely to have been leaked in the past; simply because it's younger).
Our current default workflow does indeed prioritise security over disaster recoverability.
The subset of users who care about disaster recoverability are quite vocal, so I think we should do something to help them.
A common ask is control over sealing key renewal, but what I really think should happen is control over advertising the public key of the newly created key as "current" (and thus causing the possibility of encrypting an unrecoverable a secret until a backup is done).
A possible "almost zero knob" UX for this would be to add a delay before a new public key is advertised as current.
For example, imagine adding a default delay of renewal period / 2.
Let's draw a period of 10 days and a delay of 5 days:
now, this means that as long as you run a periodic backup every 5 days, no matter what's the phase of your backup w.r.t the key renewal, every time a key becomes "current", you're guaranteed to have backed it u:
i.e. at any vertical slice of that diagram, any of the possible backup schedule "phases" already contains the current key number: