WebOfTrust / keria

KERI Agent in the cloud
https://keria.readthedocs.io/en/latest/
Apache License 2.0
20 stars 31 forks source link

Members of multisig group can become out of sync without a way to recover #213

Open lenkan opened 8 months ago

lenkan commented 8 months ago

Affected versions

Reproduction script

See a reproduction script here: https://github.com/nordlei/vlei-sandbox/blob/main/src/issues/multisig-issuance-problem.test.ts

❯ src/issues/multisig-issuance-problem.test.ts (6) 49803ms ✓ Resolve OOBIs 5089ms ✓ All members create multisig group 7305ms ✓ All members create registry 4019ms ❯ Credential issuance (3) 23679ms ✓ Member 1 creates the credential 9502ms × Member 2 creates the credential - by misunderstanding 23677ms ✓ Member 3 joins credential issuance event 6332ms

Steps to reproduce

  1. Create three signify wallets with one identifier each, one for each member.
  2. All 3 members create multisig group with signing threshold 2 (i.e. 2 out of 3).
  3. All 3 members create credential registry
  4. Member 1 creates credential
  5. Member 2 creates credential (by misunderstanding the instructions)
  6. Member 3 correctly joins the credential event of member 1.

After this step, the group state of member 1 and 3 are synced. But member 2 is out of sync. The latest event on their multisig group will not be the same as for member 1 and 3.

How does member 2 get out of this state? They cannot simply join the event create by member 1 without first rolling back the event they accidentally created.

Notes

From previous discussion on discord

Phil Feairheller — Today at 2:10 PM

In the command line of KERIpy there is a multisig rollback command for deleting partially signed events at the tip of your KEL. I don’t think anyone has added that functionality to KERIA.

edeykholt commented 3 months ago

Would a general design approach to consider for resolving various multisig group ceremonies that could get stuck be to introduce a time-to-live concept? In such an approach where these ceremonies are stuck would get reset (expired) because the initial request would expire if not fully committed after the default TTL for that type of request or an explicit TTL in the request.

2byrds commented 2 weeks ago

See https://github.com/WebOfTrust/signify-ts/pull/286 which addresses the 'catchup' for member 2 (they have the exn, notification).

2byrds commented 1 week ago

Would a general design approach to consider for resolving various multisig group ceremonies that could get stuck be to introduce a time-to-live concept? In such an approach where these ceremonies are stuck would get reset (expired) because the initial request would expire if not fully committed after the default TTL for that type of request or an explicit TTL in the request.

@edeykholt a lot of these issues are with one member getting out-of-sync with the group ops (the group ops succeeded). in terms of the out-of-sync member, i think the rollback i'm implementing could eventually be part of the 'time-to-live' that you mention. Perhaps it should be configurable so that the member can configure their expectation for the event, something like 'Lets issue this credential over the next 60 minutes, but if it doesn't finish by then, rollback this operation so we can try again later'.

edeykholt commented 1 week ago

@2byrds Seems like a good step forward.