WebOfTrust / keripy

Key Event Receipt Infrastructure - the spec and implementation of the KERI protocol
https://keripy.readthedocs.io/en/latest/
Apache License 2.0
60 stars 55 forks source link

Multi-sig exchange transaction coordination #884

Open iFergal opened 2 weeks ago

iFergal commented 2 weeks ago

Feature request description/rationale

Edit: Renamed issue based on discussions. Sequence numbers alone can't solve this even for IPEX.


@SmithSamuelM suggested this last week when I described the scenario where multi-sig members can deadlock their view of an exn transaction set or IPEX set.

For example, in a 1 of 2 multi-sig, if both members IPEX admit a credential issued at the same time, but with a slightly different dt, each member will have a different view of the IPEX transaction set because the SAIDs of each admit are not the same. This can happen with different setups too (e.g. 2 of 3).

So either we need to synchronise the members at a higher level, or use sequence numbers over timestamps. I like the idea of a sequence number because it it's simpler and avoids making interoperability harder.

However right now keripy has dt as a required field top-level field for exn messages.

SmithSamuelM commented 2 weeks ago

We should put this on the agenda to discuss. There are trade-offs either way.

The SAID of the previous exn, in the p field uniquely orders a set of EXNs. The SAID is necessary for that ordering to be cryptographically strong. Another ordering field that is monotonic is helpful in managing events in escrow or pre-ordering when the prior event has not been received yet.

The challenge, as pointed out, is that when multiple parties must sign a given event, then there must be some synchronization method to ensure they all sign the same event, as in the event with the same SAID.

One solution is to have the ordering field be predictable so that each signer may asynchronously generate the same event. So a SN is preditable whereas a timestamp is not.

However, in many transactions, the next event type may not be predictable, or there may be other fields that have options and all the signers may not be in agreement as to what those values should be without some side channel synchronization method.

So my suggestion at IIW to use SN instead of timestamp may have been a little hasty.

For example, one could just leave the ts field empty, and depend soley on the p field value to order the events. In that case, one could add a SN in the body section.

I am trying to minimize the changes. In many exchanges, a datetime is essential to the transaction, but allowing the field to be empty would be preferable to omitting the field. Because once field become optional one can no longer ever use a compact fixed field (no labels) serialization. And we want to preserve that capability, at least at the top level.

So adding a modifier that when the dt is the empty string then it is ignored and the transaction body may include some other type of field for ordering such as a sn.

To cover all possible conditions, I suspect, will require that the members of a mult-sig group enagage in a side channel discussion to decide before they sign what the next event should be.

The side channel discussion could be yet another exn exchange but, single sig, between each member of the group.

SmithSamuelM commented 2 weeks ago

Some additional thought. EXN are meant to be generic wrappers for transactions. Each transaction type is expected to define the payload of its EXNs. In that case, the dt field in the wrapper is non-material. Likewise different wrapper SAIDs shouldn't matter in a multi-sig case because the multi-sig committment that must be verified as satisfying the multi-sig threshold is not to the wrapper which is unique per member, but to the payload, which can be made to be the same for every multi-sig member. The SAID of the payload then becomes what is verified. Since a signature on the wrapper is a signature on anything inside the wrapper, a verifier just has to verify that the SAID of the payload is the same for all signatories.

Alternatively, the payload could itself be a wrapper with a nested payload and an embed of a signature on the nested payload.

iFergal commented 2 weeks ago

For multi-sig IPEX, /multisig/exn messages are used as wrappers of /ipex/admit (for example) to communicate. The outer wrapper where i is the member identifier. These can be different and are unique per member, no issues there.

The problem is just related to the embedded /ipex/admit exn message where i is the multi-sig identifier. Since other top level fields like the route and p are necessary to the transaction, are you suggesting that some of this information gets moved to the payload/a field? I'm not sure it'd work especially since for offer or grant the e field can contain ACDC embeds.

{
  "v": "KERI10JSON000178_",
  "t": "exn",
  "d": "EAD9sCozkmrtCP3JZ_0k4K3VKWCs0gGu3lmrExlER61I",
  "i": "EMKGivK9H_eGeTeyBlgL5KiK7jCW4jdSY_v-M-999CT1",
  "rp": "EExhKobSlF3vSBDgQGJYBFjRRs7A1LsUI9qlfb5K9TVt",
  "p": "EFvuOuuDbEbyHvGEExuu_3mHHxZZHXXhtbuibaLp_Itj",
  "dt": "2024-11-05T14:33:44.468000+00:00",
  "r": "/ipex/admit",
  "q": {},
  "a": {
    "i": "EExhKobSlF3vSBDgQGJYBFjRRs7A1LsUI9qlfb5K9TVt",
    "m": ""
  },
  "e": {}
}
SmithSamuelM commented 2 weeks ago

@iFergal What I am suggesting is that the there be a SAID in the a section which is the same across all member commitments to a given admit. The signature protects against replay attacks by non-members but the validation of the multi-sig is against the SAID in the payload.

There are other similar ways to make a multi-sig committment to a transaction.

For example, the "settlement" or "finalization" of the transaction is not made per se with a signature on an exn but by anchoring a payload in a TEL/KEL. So a verifier is not verifying a multi-sig on the exn itself but that a threshold majority of the members have anchored the payload. This demotes the exn to a secure wrapper.

These complexities are transaction type specific. So they don't belong at the EXN level, they are at the payload level.
Attempting to build an EXN that satisfies every contstraint of every type of transaction is an exercise in futility.

The top level EXN routes are typical to most interactive transactions but the payloads are not. So the solution is to think more about how to commit to a payload with multi-sig and less about changing exns to be multi-sig. The problem with the later is that single sig transactions get really complicated unnecessarily.

SmithSamuelM commented 2 weeks ago

Think of an exn as a peer-to-peer secure wrapper for conveying payloads that are part of a transaction. Finalization of the transaction with multi-sig can happen independently of the peer-to-peer wrappers.

SmithSamuelM commented 2 weeks ago

In general the difficulty of multi-sig is that there needs to be a coordination mechanism for the members of the multi-sig group to agree. This coordination can happen with a synchronized communication mechansim or with an asynchronous communication mechanism. The latter requires that members of the group can "predict" the serialized data that they must all independently commit to. With KELs and key events this was done. But EXNs at the top level are not designed for that. So if the coordination is to be asynchronous then the payload of the EXN must be predictable. Otherwise, there needs to be a side channel coordination protocol that synchronizes the exns.

iFergal commented 2 days ago

We discussed this a few couple of weeks ago on the dev call. It seems impossible to have predictability of IPEX in all cases, especially when spurn is considered.

Long term the idea of anchoring in the KEL seems cool but I'd maybe revisit that to avoid too many changes in keripy/KERIA.

For now, I'd like to explore the side channel communication between participants. I'm taking a look at enhancing the existing /multisig/exn wrappers to have members propose a next step in the transaction and agree/reject. Once enough agree on a proposal, they can sign and submit the IPEX message.

In case of multiple concurrent proposals within a certain time period, a simple resolution might be the members index in the smids array, which is the same for each member of the multi-sig.

The challenging part is handling nodes being offline or going offline after proposing. Need to think more on this, open to suggestions. Distributed consensus algorithms assume a number of online nodes for fault tolerance, which we mightn't have in this kind of a communication.

I think I'll need some rollback functionality to cover all cases.

2byrds commented 2 days ago

At the vlei-dev-community meeting we continued to discuss the difficulty of syncing the different scenarios/states related to off-line multisig participants. @daidoji suggested that we could use a round-robin mechanism like KAWA which is provably convergent. To have consensus BEFORE anchoring, using KAWA there is a proposal and a list of next signatures. Both participants race for a threshold of signatures of the on-line participants (or eventual off-line participants returning). Whoever completes KAWA and it's witnessed wins. @iFergal noted that IPEX is completely off the KEL. If we introduced this KAWA technique, you would need several internal KERIpy changes to poll witnesses, etc. in order to 'settle' the result.

pfeairheller commented 2 days ago

From the original issue:

For example, in a 1 of 2 multi-sig, if both members IPEX admit a credential issued at the same time, but with a slightly different dt, each member will have a different view of the IPEX transaction set because the SAIDs of each admit are not the same. This can happen with different setups too (e.g. 2 of 3).

First of all, if each one admit a credential the net result is that both now have a copy of the credential which is all that should matter unless you need the IPEX transactions to have meaning beyond conveying the credential. If you need the "transaction set" to be the same between parties and have "meaning" then you are overloading IPEX. In that case you would need to anchor the IPEX messages so their signatures have long term meaning. Unanchored exn messages are only intended to have transient meaning. If you want more than that, you need to anchor them which prevents this problem.

Additionally I don't see how this can happen with a 2 of 3 setup, or any other setup that follows proper threshold rules. How can a 2 of 3 accept different admit messages for the same credential unless one of the members admits the same credential twice? That feels a lot like a PEBKAC problem and not anything to do with the protocol...

iFergal commented 2 days ago

First of all, if each one admit a credential the net result is that both now have a copy of the credential which is all that should matter unless you need the IPEX transactions to have meaning beyond conveying the credential. If you need the "transaction set" to be the same between parties and have "meaning" then you are overloading IPEX.

Thanks. Fair enough, and I understand re anchoring them. But even for short-term meaning, what about an IPEX offer? If the other party responds with an agree, one member would consider it invalid.

Additionally I don't see how this can happen with a 2 of 3 setup, or any other setup that follows proper threshold rules. How can a 2 of 3 accept different admit messages for the same credential unless one of the members admits the same credential twice?

You're right, 2 of 3 should be OK, my bad. And I'm not saying there's anything wrong with the protocol, I'm just trying to come to a solution in general, IPEX is just the example I'm dealing with at the moment.

In something like 2 of 4, it can also happen. I understand the need to have thresholds set up properly with witness pools to avoid these scenarios. However, I think it's quite limiting to limit group members to these same threshold rules because business needs might not match that. But maybe I should reconsider.

pfeairheller commented 2 days ago

When @SmithSamuelM first introduced transaction event logs (TELs) it was a very generic mechanism to secure transactions of any kind. In our community it has become synonymous with the VC credential registry but that is only one (very simple) use case for it. In this case, you could create a series of TEL events for IPEX that are anchored into a KEL to get you exactly what you want and also use the one (and only) synchronization mechanism in KERI, the KEL. Then you get things like KAWA with witness thresholds for free instead of trying to create a bespoke version for just IPEX.

SmithSamuelM commented 2 days ago

A very common pattern in business is that a designate of the business conduct all the negotiations to produce and agreement or contract or document (issuance) which Issuance is then finalized throuh endorsements by a hierachy of decision makers at the business.

This pattern could be implemented with single sig negotiations that finalize by anchoring the Issuance in the KEL or KELs of the hierachy of decision makers.

As @pfeairheller points out. IPEX was meant as an ephemeral exchange to negotiate the terms of an issuance that is an ACDC where the negotiation supports graduated disclosure. That negotiation is finalized finalized by issuance of the ACDC. Should that issuance have contractual terms as in contractually protected disclosure, then those terms are embedded in the Rules section of the ACDC. The finalization could be accomplished by both parties anchoring (i.e. endorsing) in their respective KELs.

The anchoring does not have to be one anchor at the end. A given EXN could be anchored during the transaction as part of the transaction logic.

Multisig complicates things and thresholds that match business logic require coordination. That coordiantion can be out of band or in band. But when its in band then one needs a bespoke protocol to protect the in-band coordination from deadlocks etc. So either one designs the thresholds appropriately to prevent deadlock or one uses a different mechanism and threshold for in-band negotiation. So in general business logic thresholds should be designed with out-of-band business logic friendly coordination, and if the out-of-band messes up you get a deadlock. So you just start over. Which happens every day in real businesses.

Attempting to solve all of these problems everywhere at once would create a monstrosity of a protocol. Might as well use a block chain at that point.

iFergal commented 2 days ago

@pfeairheller Yeah, I do like the idea of that. I was only avoiding it now because of too many possible keripy/KERIA changes for now because of my upcoming deadlines.

But yeah, the happy path (for what I suggested) is pretty quick to implement out of band but the unhappy path is a big pain. So I'm starting to think that KEL anchoring is less work.

So in general business logic thresholds should be designed with out-of-band business logic friendly coordination, and if the out-of-band messes up you get a deadlock. So you just start over. Which happens every day in real businesses.

Yeah, if I understand correctly this out of band coordination is what I was trying to do. And have a rollback for when it messes up.

But by the time you are "starting over" to avoid the deadlock, the other party or issuer could have multiple sufficiently signed IPEX messages, so even that's more complicated. (unless, as suggested, we rely on KEL anchoring)

SmithSamuelM commented 2 days ago

As you all know, KAWA is not required. A given controller can pick any threshold they want for their witness pool. Its up to validators to decide if they want to engage with controllers who have problematic thresholds for their witness pools. But if someone does use KAWA then there is an assurance of either one agreement or no agreement.

Similarly KAWA rules could be applied to multisig thresholds. Or EGFs could put restrictions on the types of thresholds. I frankly think that may not be the best way to manage that. But ultimately a controller is responsible for the thresholds they choose to use.

Lets take a real world example. A business may have a purchasing department with multiple purchasing agents. Each agent is authorized to issue purchase orders. These means you have deadlock or conflicting issuance potential built into the structure of the business. Two different purchasing agents could issue two different purchase order for the same products from the same vendor. This could arise simply from miscommunication among the purchasing agents. The business only needs one set of the products but say the purchasing director recognizes a deadline, asks if the PO has been issued of some of the agents, who have not. An agent who is not at work issued it before leaving the day before and didn't tell anyone. So another agent issues a duplicate PO only slightly different with a different datetime and PO number.

How does that get resolved?

Is it the job of KERI to fix these sort of business coordination problems. That would be problematic.

There are lots a ways the business could decide to coordinate on its end to ensure one and only one PO is ever designated to be issued. This may have thorny corner cases.

So the use case that started this issue, is basically that same. Two members of a group multisig do not coordinate out-of-band so they misissue or redundantly issue an exn.

We can't solve those problems in general unless we go the route of having one and only one way to do everything, which looks like a shared distributed consensus ledge.

Its OK to impose on that business a requirment that they coordinate out of band on their multisig.

As a service for specific workflows one could decide to restrict the types of multi-sigs that are allowed or supported so that one can facilitate via that service more automation in that workflow.

But the hard problem is to decide where to draw the line and not succumb to the temptation to move the line.

SmithSamuelM commented 2 days ago

In the context of vLEIs which are fully public and pre-determined there is no graduated disclosure needed so IPEX gets really simple. And the EGF for vLEIs bounds and limits what works.

So I assume that this issues are arising because people want to Issue some ACDC besides a vLEI and want to support graduated disclosure.

Well that means designing up front at least the outline of the EGF and then using that to limit what sort of workflow needs to be supported.

Doing the hardwork of the use case analysis, then the EGF, then code. Not code, then use case, then EGF.

SmithSamuelM commented 2 days ago

The idea of smart contracts has created a false sense of propriety for this type of code. Which is automating business processes, aka a smart contract.

The big myth of smart contracts is the phrase "code is law". In any practical application there are always some manual processes that are essential to cover the full spectrum of fault conditions.

In the automation world we call these safety jackets. We want to automate safety jackets when feasible but for many use cases or fault conditions there are no viable autometed safety jackets that are actually safe.

So smart contracts by defintion of "code is law" preclude manual safety jackets. This means therefor that in any real world complex practical application either your smart contract is too fragile or too dumb or both.

One purpose of an EGF is to draw the line between manual processes (like out of band coordination) and automated processes for the use cases governed by the EGF.

Drawing these lines is what makes the system robust because you define all the safety jackets both manual and automated in the process of generating the EGF.

You don't have to formally create an EGF but you need to design your business process workflow with both automated components and non-automated components and decide the policy for both and make the trade-offs e.g. draw the lines.

This should all happen before there is a discussion of well this threshold could result in a deadlock so we need to protect against such a deadlock.

iFergal commented 2 days ago

@SmithSamuelM For context, I am not developing around a specific EGF because my team is building a mobile wallet that can be taken by others for their use cases and EGFs. It has been for sure challenging, because it's so broad and has all of these potential edge case issues.

I also don't want to complicate the KERI libraries just because I'm targeting unknown edge cases. There is the alternative approach that I simply impose some restrictions until a need arises from a customer to remove them. Though I don't like the limitation of group thresholds needing to be set up to avoid deadlocks since, as you pointed out, it easily may not match business needs.

iFergal commented 1 day ago

@pfeairheller Just to clarify, since I remembered now!

In 2 of 3 you obviously shouldn't have a situation where there's 2 fully signed paths, but you can still deadlock because keripy stores the IPEX forward reference (hby.db.erpy) without needing to be fully signed.

Some added debug logs to test_multisig in KERIA (which is 2 of 2 multi-sig):

member1: submitting ipex/admit (said=EEmZ-zQnIo4iWValihlGeqUWx7rF7lEdPBnW0wh3mkK9)
Logging event {'v': 'KERI10JSON000171_', 't': 'exn', 'd': 'EEmZ-zQnIo4iWValihlGeqUWx7rF7lEdPBnW0wh3mkK9', 'i': 'EEJCrHnZmQwEJe8W8K1AOtB7XPTN3dBT8pC7tx5AyBmM', 'rp': 'ECJg1cFrp4G2ZHk8_ocsdoS1VuptVpaG9fLktBrwx1Fo', 'p': 'EBH4vhJOxM5SMAOIFyNc1pBsuBYb1CfJ4qEmNZrwHB5R', 'dt': '2024-11-22T11:44:24.924643+00:00', 'r': '/ipex/admit', 'q': {}, 'a': {'i': 'ECJg1cFrp4G2ZHk8_ocsdoS1VuptVpaG9fLktBrwx1Fo'}, 'e': {}}
member1: forward reference from grant: EEmZ-zQnIo4iWValihlGeqUWx7rF7lEdPBnW0wh3mkK9
member2: forward reference from grant: None
member2: submitting /ipex/admit (said=EEmZ-zQnIo4iWValihlGeqUWx7rF7lEdPBnW0wh3mkK9)
Logging event {'v': 'KERI10JSON000171_', 't': 'exn', 'd': 'EEmZ-zQnIo4iWValihlGeqUWx7rF7lEdPBnW0wh3mkK9', 'i': 'EEJCrHnZmQwEJe8W8K1AOtB7XPTN3dBT8pC7tx5AyBmM', 'rp': 'ECJg1cFrp4G2ZHk8_ocsdoS1VuptVpaG9fLktBrwx1Fo', 'p': 'EBH4vhJOxM5SMAOIFyNc1pBsuBYb1CfJ4qEmNZrwHB5R', 'dt': '2024-11-22T11:44:24.924643+00:00', 'r': '/ipex/admit', 'q': {}, 'a': {'i': 'ECJg1cFrp4G2ZHk8_ocsdoS1VuptVpaG9fLktBrwx1Fo'}, 'e': {}}
member1: forward reference from grant: EEmZ-zQnIo4iWValihlGeqUWx7rF7lEdPBnW0wh3mkK9
member2: forward reference from grant: EEmZ-zQnIo4iWValihlGeqUWx7rF7lEdPBnW0wh3mkK9

The above is OK since they signed the same IPEX in the test, but would be an issue if the timestamp was off, as previously mentioned. So even in the simple case, some extra work could be needed. Unless the forward reference (hby.db.erpy) should only be there once it's fully signed?

SmithSamuelM commented 1 day ago

->my team is building a mobile wallet that can be taken by others for their use cases and EGFs.

May I suggest, that the solution for this, given multisig is a more human centric approach to OOB coordination. Historically distributed consensus algorithms made the assumption that any and all participants were potentially malicious and therefore the algorithm needed to be BFT (Byzantine Fault Tolerant). But this takes one down a path that does not match the business use case for a multi-sig.

It is a safe assumption that if a business creates a multisig group AID that all the participants can be assumed to have strong incentives to cooperate in a largely non-malicious way. So the coordination can have manual fail-over for the malicious or byzantine types of faults and only need to support in an automated way non-malicious faults like unavailability. And this can largely be accomplished with a UX that enforces the coordination.

For example, using the purchasing department example from above. Suppose that there is an Invoice leaderboard in software that publishes to the group of purchasing agents all work in progress Purchase Orders and who they are assigned too. The purchasing director can reassign any PO in progress to a different agent should there be a fault (unavailability) of a given agent to complete the. workflow.

This generalizes to wallets for any issuance where the AID is multi-sig. Always pick a lead participant and then create a non-byzantine fault tolerant (i.e. availability fault tolerant) way to reassign the leader. No one will ever generate two competing versions of an event. That would be malicious. This can be enforced in software such that only the leader assigned to a given workflow is allowed to generate an event. This is not enforced by the protocol but by the UX.

All the non-leader participants can do is confirm or deny by endorsing. Hence no deadlocks. If they do not endorse because they object then the process stops and some OOB coordination such as chat can be used to facilitate overcoming their objection.

IMHO these are not good applications of BFT. We don’t want to impose BFT becasue businesses are not Byzantine. The Bysantine generals problem is becasue you had potential adversaries (armies) attempting to cooperate. But within each army is is save to assume that the underlings obey their general becasue they have a very strong incentive to obey. Within a business this is true in general. Someone who acts maliciously looks like fraud or is fraud or at the very least is insubordination all of which are firable offenses.

This prevent deadlocks

But you can look at it another way. Deadlocks are a good thing becasue they protect a given party from themselves where they have malicious participants on their behalf.

So you need a way to start over after you fire your own malicous participant. This would require a key rotation so the transaction woudl need to be restarted.

So at the risk of introducing complexity. One way to do this in general would be to add a kill message that only needs to be signed by any member of the group multisig not a threshold. They any member can kill the transaction and force it to start over.

This overcomes any deadlocks.

Of course one can have an implied kill with a timeout. Any deadlock eventually triggers the timeout and the transaction is killed automatically. This is simpler but introduces the latency of the timeout.

iFergal commented 1 day ago

Thanks @SmithSamuelM - very valuable information!

Designating a particular leader makes sense to me - for the generic IPEX wallet case it can simply default to the person who originally created the first /multisig/icp event to be shared. And see in the future if that needs to be revised based on other business needs. I think I'll get this implemented first.

Just a question on the reassinging of the leader. In case the leader is the one who is unavailable, how is the leader re-assigned? Because if the other participants force a re-assign, then it can deadlock if the original leader comes back online and submitted before realizing it was no longer the leader.

Or are you suggesting only the leader can re-assign? For me, this is OK for now. I just anticipate the problem of the leader of the group being on vacation. :P

edit - or, of course, in case of this deadlock for leader re-assign, could go for kill

SmithSamuelM commented 1 day ago

how the leader us reassigned is a custom business logic setup since leader enforcement is OOB to keri and IPEX.

sothe UX that manages who is yhe leader could allow tge current leader to pick a new leader or it couls be any of a dedignted set such as a nanager. its enforcement coukd bebpyrely software driven

imagine that before any exchange message is generated it goes through a service. this synchronizes everything

iFergal commented 1 day ago

Makes sense, thanks!

For now I think the leader re-assigning the current leader is good enough for me since it avoids me having to generically handle the edge case I mentioned above.

imagine that before any exchange message is generated it goes through a service. this synchronizes everything

This actually sounds great. Groups could designate this exn sync service!

2byrds commented 1 day ago

That seems like a nice practical approach! I appreciate you sharing the discussion here for us to learn.