dandanlen commented 2 years ago

Description

We want to move the concern of starting the keygen verification stage to the state chain.

Right now, after Keygen completes, the CFE automatically runs a signing ceremony with the new authority set, signing the key itself (I believe? TBC).

This will require picking apart some of the logic in the vaults pallet: Currently when keygen completes, each node returns a new key and the signature, which are verified against each other, and if they match, are tallied similarly to a witness vote. This single step will need to be broken up into:

Reporting the key
Reporting the signature

Voting for each step will need to be tallied according to the existing method where 100% consensus is required for a new key. We could shorten step 2. by using our knowledge that a valid signature can only be generated by 100% of nodes participating, ie we could use an unsigned transaction and just validate the signature instead of counting votes.

As a starting point, we will need a new entry in the VaultRotationStatus enum, AwaitingVerificationSignature, and a new extrinsic report_verification_signature. report_keygen_outcome and the ReportedKeygenOutcome will need to be modified to process just the keygen portion.

Also, this will require some refactoring on the threshold signer pallet and associated traits, since the verification ceremony uses the full authority set and the new/next key rather than the current authority set and key. So we need to add a method request_custom_signature for example, that allow us to pass in the key and the list of ceremony participants.

I would suggest splitting this into two PRs: One for the work on the threshold signer pallet, then another for the vaults pallet.

Alternatives Considered

Leaving as-is. This requires re-using the keygen ceremony id for the verification ceremony, which has caused some problems (see #1972). We seem to have fixed the issue but shifting this concern to the state chain ought to be more robust and predictable.

Additional context

1972

If we requested a keygen at ceremony id 1 and performed signing ceremonies for ceremony id 2 and 3, then we perform the keygen verification for ceremony id 1 (since we currently reuse the same ceremony id for the keygen verification) then the ceremony id tracking fails since the latest ceremony id we had was 3, and now we have a request for a ceremony id of 1.

Blocked by:

[x] #2016

kylezs commented 2 years ago

which has caused some problems

What were the problems?

dandanlen commented 2 years ago

Sorry, a bit cryptic. It's in the additional context, I've edited to make it clearer.

msgmaxim commented 2 years ago

Are we still doing this @morelazers? From our discussion yesterday, it sounded like you were more in favour of a hacky solution on the CFE instead?

dandanlen commented 2 years ago

more in favour of a hacky solution on the CFE instead

I had a discussion with Tom about this, and was pushing in this direction; I'm in favour of solving this the easy way (I would hesitate to call it hacky) at least in the short/medium term.

The problem we're trying to fix was introduced by a spam protection measure, which turned out to be overly aggressive. We can solve the problem simply by being less strict on spam prevention on the lower bound. We don't have to modify the protocol to fix this. The solution is not perfect, but now that we understand the problem, it seems very unlikely that the keygen ceremony would be overtaken by more than a few signing ceremonies. If we add a generous margin then we should be safe. The margin is 6000 on the upper bound, 1000 on the lower bound ought to be generous enough.

Changing the protocol is a non-trivial change, specifically it introduces another SC<->CFE round-trip with all the witnessing and tallying of votes, edge cases, timeouts and failure modes that go along with. While I agree that it feels cleaner in some sense, the above solution seems 'good enough' not to want to risk this change.

Happy to discuss this - we can keep the option on the table but I would classify it as low-priority for now unless there are any strong objections.

msgmaxim commented 2 years ago

To clarify, I'm not pushing for the solution on SC, but simply stating that it is the cleanest solution, like you said. I do believe the decision to reuse keygen ceremony id for signing was already a hack (the SC is the only legitimate source of valid ids and we are trying to circumvent that). The issue we encountered with spam prevention is just a symptom of a deeper problem. Now it makes us create another hack on top of it, and we might be forced to add still more hacks in the future. Note, for example, that doing this properly would also address the issue described in https://github.com/chainflip-io/chainflip-backend/issues/1980 in the most clean way. Of course if it requires a lot of work on the SC and we have other priorities, then I agree we should go with the more pragmatic solution.

The margin is 6000 on the upper bound, 1000 on the lower bound ought to be generous enough.

If we are going the CFE route, I actually don't like this particular solution. I'm not in favour of leaving it to chance: it is conceivable that during the time the keygen ceremony is run (5-10 min depending on parameters), we have more than 1000 signing cermeonies (which is roughly 2 ceremonies per second, and I like to think we should support more than that). Of course we can play with the parameters, but note that 6000 is how many ceremonies we can have between a keygen is requested by the SC and received by some node's CFE (which ought to be no longer that 10 seconds or so realistically), so we'd need a much larger lower bound if we want the same level of confidence (which might defeat the purpose of spam prevention).

chainflip-io / chainflip-backend

Trigger keygen verification from state chain #2004

Description

Alternatives Considered

Additional context

1972