How/when to do proper backups?

real-or-random commented 2 weeks ago

Conditional agreement is nice, but it doesn't prevent the following bad trace of events:

ChillDKG runs normally, but only participant A receives a success certificate. All other participants receive nothing or invalid signatures.
Participant A "uses" the threshold public key by sending money to it.
Participant A sinks in the ocean together with the backup of the recovery data.

Outcome: Noone can sign, the money is lost.

(Event 3 means someone wasn't super careful with the backup, but I think it should be a feature of threshold signatures that any t participants should be able to sign, even if some other participant loses their backup.)

This problem doesn't affect all settings: If using the threshold public key involves interaction with all the participants, e.g., in a threshold wallet owned by a single user who looks at all the devices to verify an address before receiving on it, success certificates could be presented and backups made before the address is confirmed by all devices.

But it's certainly a problem in some settings, e.g., mutually distrustful remote participants, where one participant uses public derivation to create an address and send to it.

One could think that it suffices to add another round of signatures, but if that's done naively, then this just postpones the problem one round. We'll need to do more.

The only solution I can think of is this:

Add another round of signatures.
Have a "partly successful" state and a "fully successful" state.
A participant is partly successful before sending out their second signature, i.e., after receiving the (current) success certificate. And it's fully successful after receiving all second signatures (call this the second certificate).
As currently, only fully successful participants can "use" the threshold public key. (Think of "fully successful" as returning the threshold public key.)
Partly successful participants were able to create and back up the recovery data. They are happy to participate in signing sessions. (Think of "fully successful" as returning the secshare.)

Then the trace will look as follows:

ChillDKG runs normally, but only participant A receives the second certificate. All other participants receive nothing or invalid second signatures, but they have created backups of the recovery data before sending out their second signature.
Participant A "uses" the threshold public key by sending money to it.
Participant A sinks in the ocean together with the backup of the recovery data.
"New" participant A can retrieve the recovery data from any participant, and recover from it and the seed.
We have least n participants who sign and spend the money. (But this is DKG session is still degraded because none of the n participants can use the threshold public key, so a new session will be necessary...)

This gives you the following property: If you are fully successful, then every other honest participant has a backup of the recovery data. (And if you have the recovery data, you can at least restore everyone to the partly successful state, which suffices to create signatures.)

Sigh. Adding a round and having two states adds complexity that I'd love to avoid, but I currently don't see another solution. And I think it's a real problem; just ignoring it doesn't seem to be a good idea either. For a moment, I thought we could try to keep the additional round and the description of this out of the "official" protocol and explain it only informally. But that's a bit lazy on our side, and I think it makes things just harder for the user and increases the footgun potential.

For completeness, here's a worse variant of this without adding a round or a second certificate:

A participant is partly successful before sending out their signature. And it's fully successful after receiving all signatures, i.e., the success certificate.
Partly successful participants were only able to create a local backup of their shares. But they may not even have the first part of the recovery data without the certificate because the shares they have received for other participants may be garbage. They are happy to participate in signing sessions, which should not affect unforgeability. [^1]

Then you can at least restore to partial successful from the local backups.

This means that

We'll need two backup steps, one for the local "partial" backup (which is different for every participant!) and one for the full recovery data. Sounds very annoying if you, e.g., print a sheet of paper as part of each backup...
If more than n-t participants sink in the ocean, you can't restore at all...

[^1]: One needs to incorporate this distinction between fully/partly successful in the unforgeability definition, to make sure that the adversary gets a working signing oracle for some participant even if that participant is only partly successful (yet). That's not a fundamental problem, it's just a bit involved. (Note to myself: The reduction in CGRS23 currently extracts the pops from an arbitrary participant; this needs to be changed to the first participant who succeeds partly and thus has received valid pops. This is to ensure that the pops are available before the first signing query, which is crucial for the early aborting strategy in algorithm D*.)

real-or-random commented 1 week ago

Not convinced that this helps, but here's an analogy to atomic commit protocols:

Our current protocol is a bit like two-phase commit (where the broadcast by the coordinator is the proposal and the signatures from the participants are the ACKs). Then, adding a round makes a bit like three-phase commit, and the motivation for three-phase commit is precisely to avoid a situation where one participant has committed but noone else will know about this (see https://en.wikipedia.org/wiki/Three-phase_commit_protocol#Motivation).

The assumptions are rather different from what we have (no byzantine failures, every participant has a local write-ahead log = local backup), and so is the way the problematic situation in two-phase commit is resolved (everyone is honest and will eventually respond, so the protocol is not dead forever but just blocks a bit). But I think the core of the problem is remarkably similar.

real-or-random commented 1 week ago

Okay, it's perhaps less dramatic than I thought. Another way to look at the current protocol is that it works with this usage convention:

"If you use the threshold public key, you are responsible for having a backup of the recovery data." (So you can at least convince everyone else to sign).

That sounds pragmatic to me. And then this issue is rather about different backup strategies: If your backup strategy is that all participants (or some subset of them) have a copy of the recovery data, then you should wait for a second round of acknowledgements before using the threshold public key. In a single-user-in-room setting, this can be as simple as receiving data only if all devices indicate "OK". (And if you have some other backup strategy, then just go ahead with that strategy before using the threshold public key.)

Once you recover from seed and recovery data and nothing again, you, again, don't know if your backups are sound, e.g., if all participants (or some subset of them) have a copy of the recovery data. In that case, you'll need to redo/verify the backup strategy, e.g., re-ask the other participants for acknowledgement. This step may not succeed, but this is still "safe": While you are degraded in the sense that you may not be able to send more money to the threshold public key, you are fully able to join signing sessions, so you can help spending any money already stored under the threshold public key. (This is exactly the same as when your initial backup step didn't work out, e.g., you haven't received enough acknowledgements.)

I think this is something that can be explained in the backup section in the BIP text. If we'd like to, we can also provide an implementation of the acknowledgements, but it's not crucial. Does that sound reasonable?

jonasnick commented 4 days ago

I think this is something that can be explained in the backup section in the BIP text. If we'd like to, we can also provide an implementation of the acknowledgements, but it's not crucial. Does that sound reasonable?

Yes. I agree that the sequence of "bad" events you came up with can be relevant in practice and that the instruction "If you use the threshold public key, you are responsible for having a backup of the recovery data." is a reasonable way to deal with this - whatever form this may take in a specific scenario.

An alternative to the third round ACK, the signers could also attempt to produce a FROST signature which adds another round (if we don't pipeline the nonce exchange round) but may rule out other subtle failure modes as well.

real-or-random commented 4 days ago

An alternative to the third round ACK, the signers could also attempt to produce a FROST signature which adds another round (if we don't pipeline the nonce exchange round) but may rule out other subtle failure modes as well.

I had considered this. As you say, it might rule out outer subtle failures modes, but it's also bit weaker in the sense that it acts an ACK, but only from t signers and not all n.

BlockstreamResearch / bip-frost-dkg

How/when to do proper backups? #27