Change to extracted data algorithm for identifier prefix to fix vulnerability

SmithSamuelM commented 3 years ago

The current extracted data algorithms for self addressing and self signing identifier prefixes does a depth first serialization of all values of elements (fields) in the inception event except for the identifier prefix element. This is does in the Prefixer Class. This makes the identifier prefix a function of those values. Previously the nature of the values was such that they were unique relative to the elements in the inception event. The original extraction algorithms was designed early in the development process and lot has changed. It only extracted values because at the time element values were either cryptographic material which is unique or other fields which were unique.

However, the fractionally weighted signing threshold has values which are not unique. As a list of lists of fraction strings the grouping of the lists is arbitrary for a given set of weights but any groups would have the same serialization. This may expose the inception event to a transaction malleability attack where two different weighted thresholds semantically have the same element serialization thus resulting in that same identifier prefix but for two different semantically inception events.

There were three proposed solutions discussed.

1) add a check with custom serialization for the sith (signing threshold ) element that insets demarcation characters between clauses and weights to ensure serialization uniqueness. This would not break any test vectors but those recently written for fraction serialization. This however may leave exposure for later similar non-uniqueness malleability vulnerabilities in the future for as yet to be defined elements.

2) Change the extract data serialization to add delimiters between each element value and each nesting of element values. this would ensure syntactic serialization uniqueness for semantically unique extracted values. This will break existing test vectors but is a general solution.

3) Use the existing serialization created in the derivation code in Nextor as the new extracted serialization. In order to compute the correct size for the version string a dummy value of the correct size is inserted into the Dict for the inception event and then serialized using the specified serialization (json, cbor, or mgpk). This serialization is then thrown away. But such a serialization is functionally sound and syntactically unique for semantic uniqueness and equivalent to 2 if not more so. Since it is computed anyway might as well use it instead of the the existing extract data algorithm. This will also break existing test vectors. As a result the extracted data serialization type is same as the serialization already chosen for the event and is not fixed. The main difference between this 3) and 2) above is that this serialization includes the identifier prefix element but with a dummy prefix. Whereas 2 excludes this element entirely. This means for interoperability all implementations must use the same dummy character in the dummy prefix. The '#' hash mark character was chosen. It should be a character that is not a valid base64 character thereby making the prefix invalid and avoiding a potential bug where the dummy serialization gets used as the real event serialization.

After discussion with the group it was decided that 3 was the preferred option. The rust developers where not in the meeting so feedback from them is desired before final commit. this pull request implements 3 so one can see the implications of the change.

SmithSamuelM commented 3 years ago

Added comments from Slack Discussion. Should put into one of the spec docs as normative text.

@chunningham

IMO it should be included, otherwise there is no commitment to derivation method and it would be possible to have multiple identifiers for the same icp event (different codes with the same output length)

The point is NOT to protect against different identifiers in different inception events where the inceptions events only differ in the identifier. They are by mere inclusion of different identifiers entirely different inception events. The point is to ensure that a self-addressing identifier may be only derivable from one and only one inception event. Want to prevent the same identifier from being derived from two different inception events that have the same identifier but differ in some other way .

It is entirely a valid use case that multiple identifiers use the same set of commitments (witnesses, keys, etc, expressed in an inception event).

The case we want to prevent is the case where, the identifiers are identical and self-addressing. but the inception events are not identical. This is an insidious form of duplicity that breaks the guarantee for self addressing identifiers that there may be only one verifiable inception event for that identifier. In other words we want the, collision space of inception events for that identifier to be empty.

Clearly non-self-addressing identifiers do not have this guarantee. Any number of inception events for a given non-self-addressing identifier may use that same identifier and still be verifiable. The only protection from such duplicity for non-self-addressing identifiers is that the first seen inception event wins. The collision space of inception events for non-self-addressing identifiers is not empty and KERI does not provide any guarantee of empty collision space for inception events for non-self-addressing identifiers (self-signing is part of the self-addressing class).

Clearly including the derivation code in the dummy in the derivation of a self-addressing identifier does not prevent a non-self-addressing identifier from using that very same inception event that only differs in the prefix. But only differing in the prefix is all we need. Two inception events that differ in the prefix only are indeed two different inception events for two different identifiers. To clarify, if inception events are different because the prefixes are different then they are not duplicitous in any way.

So given the above , what we want to do with the the dummy is avoid inadvertent use of the dummy event or confusion about the dummy event with the dummy identifier. The dummy must not be a valid identifier or confusable with a valid identifier. it is merely a placeholder to get the length right for the version string and may be reused without harm for the derivation of the self-addressing identifier. The final event has the derivation code in the real identifier and that is what matters for duplicity. For interoperability of verification of derivation we need to use the same dummy or exclude the prefix element from the derivation. Since excluding is more work we might as well just use the dummy and pick the same dummy.

I suggest that doing anything to the dummy to give it the appearance of being a valid identifier such as including a derivation code confuses the purpose of the dummy and could result in bugs or errors when not recognized as the dummy. Indeed using '#' as the character is intentional to prevent ever confusing the dummy with a real prefix.

SmithSamuelM commented 3 years ago

A guarantee of an empty collision space of inception events for a self-addressing keri identifier is a strong guarantee. This is stronger than the initial conception of self certifying self-addressing identifiers which was that the collision space of supporting infrastructure be empty. Supporting infrastructure would mean keys, witnesses, configuration traits. The idea being that if any of the supporting infrastructure changes then the identifier should change. By strengthening this guarantee to the totality of the inception event except for the identifier prefix element then we get the infrastructure guarantee as a subset. However this stronger guarantee now means that if we want to have unique self-addressing identifiers that share the same infrastructure we would have to add a nonce element to their inception events. This is not the usual case because in general one should use unique keys for unique identifiers as a security measure. But in some cases that may not be necessary. The spec allows one to add elements not reserved by the spec, so adding a nonce of any kind is not precluded.

decentralized-identity / keripy

Change to extracted data algorithm for identifier prefix to fix vulnerability #66