Identifiable Aborts - Githubissues

BlockstreamResearch / bip-frost-dkg

15 stars 7 forks source link

Identifiable Aborts #9

Open real-or-random opened 8 months ago

real-or-random commented 8 months ago

We forego robustness deliberately because it's hard/impossible in our restricted setting without a reliable broadcast channel or honest majority.

But we could aim for the weaker goal of identifiable aborts (IA), i.e., it's possible to tell who misbehaved to disrupt the protocol. Let's call the simple omission of messages "passive" disruption, and the contribution of wrong messages "active" disruption. Both forms constitute problems for us due to the lack of consensus and the possibility of equivocation. In the case of passive disruption, we can't even agree whether some participant omitted a message. In the case of active disruption, we could potentially have participants sign all their protocol messages so that victims of active disruption can generate "fraud proofs", but then due to the lack of broadcast, we can't ensure that everyone sees all valid fraud proofs, so we can't reach consensus on the set of participants to blame.

But what we can achieve in our restricted setting is a weaker form IA that assumes that the coordinator is honest and has reliable point-to-point connections to the participants. If the coordinator is honest, we make it authoritative for determining omissions. In the case of active disruption (=decrypted share is inconsistent with VSS commitments), a participant can reveal its decryption key to the coordinator, who can use it to identify the participant that has sent inconsistent protocol messages. (In variants where the coordinator aggregates VSS commitments or encrypted shares, this needs to be done by the coordinator who has the unaggregated messages.) This won't need additional signatures.

Revealing the decryption key, of course, means that a fresh one needs to be generated in a future retry. That is, if we want IA, we can't be fully deterministic.

real-or-random commented 7 months ago

From a chat with @jonasnick :

This doesn't work, at least not in the current RecPedPop with the certifying equality check.

Claim: Participants need to ensure that they either sign the transcript or reveal the decryption key (but not both). Proof: If a participant signs the transcript, some other honest participant may terminate, and in that case no decryption key should be revealed.

Now what can happen in the certifying equality check is that honest participant A signs the transcript, and then participant B accuses A of having sent wrong shares. But honest A can't reveal decryption keys at this point because this would violate the above claim.

Possible ways around this problem include:

If the certifying equality check protocol is not used, e.g., when all devices are in a single room and the user just looks at all of them, the problem disappears. For example, the user could just not confirm the equality check if there's an accusation.
If the certifying equality check protocol is used, e.g., in RecPedPop, we could try to separate the equality check from the accusation into distinct rounds accuse, reveal, and equality check. This ensures that the equality check protocol won't start -- and thus no honest participant will send out a signature on the transcript -- before it's clear that there's no accusation.

These approaches could work as long as the coordinator is reliable. But if one wants to exclude the faulty participant responsible for the abort, and restart the protocol, then we'll need some kind of retry counter, which is also part of the transcript (e.g., in the context string), and further issues arise:

The coordinator will always be trusted not to exclude honest participants illegitimately. (When excluding a participant, it may be a good idea to decrement just n and not t, to make sure that a malicious coordinator can't lower t until enough malicious participants can sign.)
If the coordinator is malicious, then multiple DKG runs may terminate: The coordinator can claim not to have received a transcript signature from some participant and thereby convince everyone to increment the retry counter and move on to the next DKG run. But the coordinator may still construct a valid certificate for the first run. Depending on the application, this may or may not be a problem. For example, when the application is just to receive coins, it's not a big deal to receive on multiple addresses. But even if it's not a showstopper in a particular application, this adds complexity.
- In applications where multiple DKG runs are acceptable, one could actually go a step further and simply run f+1 DKGs runs in parallel with f+1 distinct participants acting as signer (where f is some upper reasonable bound on the number of malicious nodes, e.g., f=t). This ensures that there's at least one honest coordinator. (This is very similar to Section 4.4 of the ROAST paper.)

real-or-random commented 4 months ago

I don't think we'll include this.

real-or-random commented 3 weeks ago

Now what can happen in the certifying equality check is that honest participant A signs the transcript, and then participant B accuses A of having sent wrong shares. But honest A can't reveal decryption keys at this point because this would violate the above claim.

The simple solution is that B reveals the decryption key to the coordinator, who can decrypt all shares encrypted to B and check which of them are incorrect. No honest participant will terminate successfully due to the lack of a signature from B. I'm not sure why we hadn't considered this...

But it's still true that this approach doesn't work with a deterministic protocol (and also certainly not with a protocol in which we reuse the decryption key as a long-term signing key). Adding a counter is dangerous because there's always the risk that an old counter value (from a failed session with revealed keys) is reused.

Here's a sketch of some protocol modifications:

We'll need to add code paths for accusing and blaming to the signer and coordinator code, respectively.
- We'll need to exchange ephemeral encryption keys instead of long-term keys. We could just make the hostpubkeys ephemeral.
Additionally, we'll need to store the decryption keys in the transcript (wrapped in symmetric encryption). Or simpler, let every party add a random salt to the transcript that is added to the derivation of the decryption key from the seed. (Or, abuse the signature nonce from the cert for this purpose. This is probably too much of an optimization. ;))

That approach isn't crazy. But still, the only thing we'd gain is a weak form of IA in which the coordinator is trusted to identify the faulty participant. I'm not convinced that this is worth making modifications. But I wanted to bring this up, after I had the above insight that weak (IA) is probably simpler than we previously thought.

@llfourn What's your thinking/approach to handle errors during the DKG in Frostsnap?

By the way, an entirely different approach to get IA is PVSS (using verifiable encryption), but that's not simpler either.

real-or-random commented 2 weeks ago

Taking a step back, I think here's a better approach: If the problem is that the coordinator aggregates shares, then we shouldn't work around this, but simply let the coordinator not aggregate shares in all cases.

The coordinator sends the entire aggregated shares_sum vector for all n participants, as currently. Then all participants can still build a recovery data as usual.
Additionally, the coordinator sends to participant i all n shares for i. This needs more communication, but it's still O(n) per participant. A minor drawback is that the coordinator now constructs "individual" protocol messages for the n participants instead of sending the same message to everyone.
Now, any honest participant who receives wrong shares can simply blame the malicious sender.[^1] This means that some problems (omissions) are blamed by the coordinator, and some (wrong shares) are blamed by the receivers. But there's nothing wrong with this.

While this needs more communication, it's much simpler than what I sketched above, and it's a good enough solution. My current thinking is that this is worth the hassle for achieving IA, and we should include this.

IA should only be necessary if a device is malicious or broken, but it's really kind of bad to tell the user: Oh, something is wrong with one of these then devices, but it's impossible to tell you more. Just get five new devices and try again. The ability to pinpoint this to two devices (the blaming device could also be the culprit) makes this much easier, and it will also help with debugging.

[^1]: If a malicious coordinator changed the share, it could make one honest participant blame another honest participant. This is bad, but this can simply be avoided by adding a MAC to the individual ciphertexts.

LLFourn commented 1 week ago

@LLFourn What's your thinking/approach to handle errors during the DKG in Frostsnap?

For a single-user keygen we're definitely not going for IA. The reason is that it requires PKI to be set up and verified by each other participant to be useful. Saying participant 7 is malicious is only useful if you know what participant 7 is in physical space. You must have established the identity of it already and mapped it to a physical device. How do you establish this? Probably through the UI of the coordinator device -- but you are trying to design something that is secure (correct IA in this case) with malicious coordinator which means it will probably just set this up initially by presenting the keys of participant 7 wrongly to all the devices ahead of time. We can ask the user to check with all the other devices that it has registered the correct key for participant 7 but this seems like more work. You could do this check in batch but this is not much different than just checking the views of all devices match at the end.

We will eventually have to approach multi-user key generation and also key generation with non-frostsnap devices. Here it would be helpful to know who to blame. I can't come up with anything better than what you've suggested for the "blame by decrypting" approach. I would be very tempted to say "just use verifiable encryption" for the shares of some kind instead. If the underlying verifiable encryption is homomorphic then you can maintain the aggregating -- you know that if you get the wrong share it's always the coordinator's fault since it must have checked the proof before aggregating them. This means it's just a drop in replacement for the original protocol that doesn't add extra messages, it just increases the size (significantly) of the encrypted shares the participants send to the coordinator in the first step.

The coordinator sends the entire aggregated shares_sum vector for all n participants, as currently. Then all participants can still build a recovery data as usual.

Additionally, the coordinator sends to participant i all n shares for i. This needs more communication, but it's still O(n) per participant. A minor drawback is that the coordinator now constructs "individual" protocol messages for the n participants instead of sending the same message to everyone.

Just to check my understanding the first point is to just keep the recovery data as small as possible.

real-or-random commented 1 week ago

Thanks for chiming in!

For a single-user keygen we're definitely not going for IA. The reason is that it requires PKI to be set up and verified by each other participant to be useful. [...] We can ask the user to check with all the other devices that it has registered the correct key for participant 7 but this seems like more work. You could do this check in batch but this is not much different than just checking the views of all devices match at the end.

I understand that you need PKI for IA. But don't you need some form of out-of-band check anyway (e.g., the user compares all the screens of the devices once) to make sure that the coordinator is not doing a MitM attack against everyone? Our design is based on the idea that you'll need a user step anyway, and so we can also use this step to establish a PKI: every device computes a hash of all public keys, and we assume that it has been checked out of band that all the hashes match.

Now that I'm writing this, I see that this could check could also be done later, e.g., when you create an address for the first time and anyway want to look at all the devices.

By the way, it's interesting that you mention single-user vs multi-user explicitly. I had this distinction in the back of my mind somehow, but I never thought about how it would translate to different formal security models. Is it just this: In the multi-user setting, every user trusts their device. In the single-user setting, the user trusts some majority of the devices, but they don't know which ones.

but you are trying to design something that is secure (correct IA in this case) with malicious coordinator

Hm, not really. A malicious coordinator can always claim that some participant has sent garbage or nothing at all. (Blaming endpoints only makes sense if you know that the network is reliable, but in our case the coordinator is the network.)

My thinking was only this: In "most" cases, the coordinator assigns blame, we'll need to explain somehow that the coordinator can never be fully trusted for this. But if we're in this special case that the shares are wrong, and then some participant A assigns blame, we can at least design it such that the coordinator cannot make participant A wrongly believe that participant B is to blame.

I would be very tempted to say "just use verifiable encryption" for the shares of some kind instead.

Indeed, it would be very useful to have a not-too-annoying scheme/spec readily available. From the perspective of the DKG, verifiable encryption will certainly be cleaner and simpler than verifiable decryption.

Just to check my understanding the first point is to just keep the recovery data as small as possible.

Indeed!

LLFourn commented 1 week ago

I understand that you need PKI for IA. But don't you need some form of out-of-band check anyway (e.g., the user compares all the screens of the devices once) to make sure that the coordinator is not doing a MitM attack against everyone? Our design is based on the idea that you'll need a user step anyway, and so we can also use this step to establish a PKI: every device computes a hash of all public keys, and we assume that it has been checked out of band that all the hashes match.

Yeah you have to do the out-of-band check at some point with a possibly malicious coordinator. Actually talking about this with the team we came up with the idea of just explicitly setting up the PKI with our devices by providing a cert for a factory key. So a device wouldn't save a key unless e.g. 5 other devices had certified the keygen for a 3-of-5 -- without asking the user to actually eyeball that the 5 devices are the 5 they have in front of them.

Assuming the user can trust us to set up and certify these keys correctly (one per device) I think this is sound. Now if we were to be malicious and give a particular malicious device (or the coordinator itself) a bunch of certified keys the worst they could do is trick an honest device or coordinator into a creating a key with a different set of participants. But as you said you have some chance to catch this in the "address check" anyway if you check with one of the honest devices that was excluded. But this requires you check the address with every device to get the guarantee which is unrealistic.

We'd probably want a way to opt in to the "trustless" version too which would introduce the step where the user out of band checks the view of the public key setup. So I think this conversation has changed my mind that doing the PKI set up is the better way to go.

The slight bit of complexity is that we'd probably want each device's keygen cert key key to be hardware secured which means we'd have to use RSA :scream: (or ECDSA on secp256kr1 if we're lucky!). I guess we could make the long term RSA/P-256 key sign a short term secp256k1 xonly key for the session so we could fully embrace this part of the spec.

By the way, it's interesting that you mention single-user vs multi-user explicitly. I had this distinction in the back of my mind somehow, but I never thought about how it would translate to different formal security models. Is it just this: In the multi-user setting, every user trusts their device. In the single-user setting, the user trusts some majority of the devices, but they don't know which ones.

Yeah this distinction really matters in our particular product design but not generally.

I mostly think about the single-user setting where either all devices are corrupt and the coordinator is honest or the other way around. The reason is that in a supply-chain attack setting if we suppose an adversary has the ability to corrupt one device then why would they not corrupt them all. We try to get decent guarantees in the setting where all devices are corrupt but unable to exfil data back to the remote attacker.

Of course we want security guarantees in the coordinator + <t device are corrupt setting but identifiable abort doesn't seem like it's worth the effort due to how unlikely it is that only one of your devices is corrupt.

In the multi-user setting we care more about IA so a malicious user cannot sabotage a key generation and then use social engineering to try and exclude some other participant.

but you are trying to design something that is secure (correct IA in this case) with malicious coordinator

Hm, not really. A malicious coordinator can always claim that some participant has sent garbage or nothing at all. (Blaming endpoints only makes sense if you know that the network is reliable, but in our case the coordinator is the network.)

My thinking was only this: In "most" cases, the coordinator assigns blame, we'll need to explain somehow that the coordinator can never be fully trusted for this. But if we're in this special case that the shares are wrong, and then some participant A assigns blame, we can at least design it such that the coordinator cannot make participant A wrongly believe that participant B is to blame.

Yes I the point I was laboring to make was confusing.

I would be very tempted to say "just use verifiable encryption" for the shares of some kind instead.

Indeed, it would be very useful to have a not-too-annoying scheme/spec readily available. From the perspective of the DKG, verifiable encryption will certainly be cleaner and simpler than verifiable decryption.

After I wrote this I realized that many of the schemes would not lend themselves easily to a coordinator that verifies and then aggregates. Interestingly the most naive scheme would: prove 256 ElGamal encryptions are to 0 (infinity) or 1 (G) and show the inner product with the powers of 2 is the correct value. Aggregating n of them, the coordinator could be confident that each bit encryption was in the range of 0..n which will be small enough to brute force once decrypted. Of course this only starts becoming more efficient than just dropping the coordinator aggregation and eating the $O(n^2)$ communication complexity for large values of $n$ like 20.

real-or-random commented 1 week ago

we came up with the idea of just explicitly setting up the PKI with our devices by providing a cert for a factory key. So a device wouldn't save a key unless e.g. 5 other devices had certified the keygen for a 3-of-5 -- without asking the user to actually eyeball that the 5 devices are the 5 they have in front of them.

Assuming the user can trust us to set up and certify these keys correctly (one per device) I think this is sound.

Sounds interesting, but I can't follow entirely. Are you saying that the certification is done in the factory, and in a 3-of-5, for every participant A, the other 5 (=4+1 coordinator) devices check the certification of A? This sounds reasonable to me then.

My thinking about PKI has also changed a bit in the past days. I believe I had that insight earlier, but it slipped out of my mind. The somewhat obvious part is that you can't avoid MitM without some trust anchor in the real world. As a consequence, the assumption that every participant has obtained an authentic host public key (what we call the identity key) of every participant is more or less required in every payment scenario.

But what's less obvious (I think): In many scenarios, this assumption is rather implicit, and we don't talk about it so much as we do here. And perhaps, emphasizing the comparison procedure (as we currently do in the BIP draft for example) so much is the wrong way to explain it to users and engineers, and we should just relate it to familiar scenarios more. In the simplest form: If you pay to the address of the wrong person, then you paid the wrong person. But even for multisig, this is not new: Suppose you use naive multisig or MuSig2 to create a multisig wallet non-interactively from a bunch of individual public keys. Sure, if you as a participant use the keys of the wrong people, then you have a multisig with the wrong people. In other words, the setup of any multisig wallet, even if much simpler than running DKG, needs the same checking of public keys.

And existing setup procedures and user documentation gets this (mostly) right, so we could leave this part to engineers using the BIP, and we don't need to specify the exchange and comparison of public keys as an explicit step of the protocol. Having authentic keys is more like a setup assumption. It needs to be explained properly, but that's all.

LLFourn commented 1 week ago

Sounds interesting, but I can't follow entirely. Are you saying that the certification is done in the factory, and in a 3-of-5, for every participant A, the other 5 (=4+1 coordinator) devices check the certification of A? This sounds reasonable to me then.

I think so. In the factory we sign each devices' public key with our factory one (i.e. set up PKI). PKI is set up beforehand so everyone can verify the other devices' keys against our factory key without doing an out of band check with the user. The out of band check is still necessary for the paranoid who don't want to rely on us not having leaked our certification key.

And existing setup procedures and user documentation gets this (mostly) right, so we could leave this part to engineers using the BIP, and we don't need to specify the exchange and comparison of public keys as an explicit step of the protocol. Having authentic keys is more like a setup assumption. It needs to be explained properly, but that's all.

Yes this is as great point. In existing multisig theuser is constructing the descriptor manually by moving over the xpubs so this "checking" happens naturally as part of the UI workflow. In our scenario we have to prod them unexpectedly to do this check so the rest can be done automatically for them. The way they check against malicious "coordinator" is also to load the descriptor to the hardware wallet and do address checking from there.

real-or-random commented 1 week ago

My thinking was only this: In "most" cases, the coordinator assigns blame, we'll need to explain somehow that the coordinator can never be fully trusted for this. But if we're in this special case that the shares are wrong, and then some participant A assigns blame, we can at least design it such that the coordinator cannot make participant A wrongly believe that participant B is to blame.

In our current code, it's, in fact, the participant who assigns blame (except in the case of invalid shares, of course). See #29.

So with what I suggested above (https://github.com/BlockstreamResearch/bip-frost-dkg/issues/9#issuecomment-2183358410), we could also just stay in this model and always let the participants assign the blame.

Advantages:

It's simple.
We won't even need the MAC.

Disadvantages:

A malicious coordinator can make a participant A wrongly blame participant B. I think that's a little worse than a malicious coordinator wrongly blaming participant B (because then it's a bit more obvious that it's a claim by the coordinator). But on the other hand, blaming is hard to understand anyway, if A's output is properly explained as "B or the coordinator is faulty", this may be fine.

jonasnick commented 4 days ago

Quoting from above

Taking a step back, I think here's a better approach: If the problem is that the coordinator aggregates shares, then we shouldn't work around this, but simply let the coordinator not aggregate shares in all cases.

As @real-or-random and I discussed offline, this approach doesn't quite work: in order to check correctness of the share, the signer needs access to the individual VSS commitments and not the summed one. This means that the coordinator also needs to send all individual VSS commitments which increases the communication to O(t*n) per signer. That's still a possible approach, but a larger change than it seemed initially.

In the version where signers provide their decryption key to the coordinator to assign blame, we should ensure that the signers do not generate the same polynomial in subsequent DKG sessions because the coordinator has partial information about the shares. So this issue is related to #30.

real-or-random commented 4 days ago

In the version where signers provide their decryption key

Note that we have static ECDH keys, so the victim can't reveal the ECDH secret key (because it's the hostseckey).

The victim could, in theory, reveal the raw ECDH shared secret and give a proof of correct decryption using a DLEQ proof. We'll use the same raw ECDH shared secret in further sessions with the other participant, but the other participant was malicious anyway, so that's not a problem in theory. But it's still wrong in practice: Either the other participant is just malicious, but then you don't want to run again with them anyway. But if you really want to rerun with the same set of hostpubkeys (e.g. because some legitimate bug got fixed), then you don't want the ECDH secret to be known.

We could switch to ephemeral ECDH keys, but this needs another round to exchange the keys.

I don't see how some variant of committing symmetric encryption (e.g., add a hash MAC) would work, at least not without adding a round. The accusing participant could reveal the symmetric key to the coordinator, but if the accusing participant is malicious and reveals a garbage key, the coordinator doesn't know who's wrong and then the accused participant needs to exculpate herself by revealing the real key.

If this is correct, then we're left only with annoying options:

Switch to ephemeral ECDH (+ 1 round)
Add exculpation (+ 1 round, at least when the protocol fails)
Add verifiable encryption (no obvious scheme available, see Lloyd's comment above)

real-or-random commented 3 days ago

If this is correct, then we're left only with annoying options:

That's not true. 5d93a8bbcff8d8092bdc1f677befc1738c2edfa2 switches to ephemeral-static ECDH, reusing the same ephemeral ECDH pubkey (let's call it pubnonce) for all recipients. Or more abstractly, this is a form of multi-recipient encryption (built from a multi-recipient KEM build from ECDH). That's a good middle-ground:

The ephemeral pubnonces need to be added to recovery data (but we wanted to add some kind of nonces there anyway due to #30, and the ephemeral pubnonces can plat that role)
No additional rounds
IA is possible without exculpation: The victim verifiably decrypt to the coordinator (by verifiably decapsulating the symmetric key via a DLEQ proof). This will still need committing encryption.

I've just noticed that the construction of multi-recipient KEM from ECDH in the paper linked above hashes slightly different stuff, for good reasons. We should switch just to that construction.

real-or-random commented 2 days ago

This will still need committing encryption.

Not true. The DLEQ proof will show that the ECDH shared secret is correct. From this, the symmetric key can be derived deterministically and decryption is also deterministic. Thus, there's no way for a malicious "victim" to provide wrong data that will lead to a wrong symmetric key.

real-or-random commented 2 days ago

I've just noticed that the construction of multi-recipient KEM from ECDH in the paper linked above hashes slightly different stuff, for good reasons. We should switch just to that construction.

Done in 4fb636071278a9d2288d8a7783bebcfe42ccd16c.

real-or-random commented 2 days ago

Tracked in #32

real-or-random commented 2 days ago

IA is possible without exculpation: The victim verifiably decrypt to the coordinator (by verifiably decapsulating the symmetric key via a DLEQ proof). This will still need committing encryption.

As @LLFourn points out in #32, this will make the victim provide the coordinator a static DH oracle, and this in turn breaks the protocol... Sigh. Thinking about this more, I suspect that all variants that involve revealing some key (or even the received share) are somewhat dangerous because they rely on fresh randomness. Even if we switch to fully ephemeral ECDH (which should be totally fine in theory), revealing the ECDH ephemeral secret key is only safe if the participant doesn't reuse randomness.

So we're back to only these ideas:

The coordinator doesn't aggregate VSS commitments and encrypted shares. This means that the first coordinator-to-participant message from will have size O(tn), and the second coordinator-to-participant message will either be the same for every participant will have size O(n²), or will be different for every participant with size O(n) (aggregating all shares except those for the specific receiving participant). In the recovery data, everything can still be aggregated as before. Assigning blame for omissions needs to be done by the coordinator, assigning blame for wrong shares needs to be done by the victim participant.
Use some form of verifiable encryption.

None of these are great. O(tn) is not the end of the world, but I'm not convinced that it's worth the hassle. I assume it can be annoying, e.g., if some implementation wants to use QR codes.