AIDs, SAIDs, etc. can have synonyms?

dhh1128 commented 2 months ago

Version

all

Environment

No response

Expected behavior

I naively expected the following 2 nontransferrable AIDs to be different, since their final character is different: BGZ_DdmzryjQOtOdQauTm_YxggbVM7EWelk8IBxsnC-c and BGZ_DdmzryjQOtOdQauTm_YxggbVM7EWelk8IBxsnC-d

Actual behavior

If I understand correctly (?), these AIDs reference the same key because when the base64url transformation back to raw bytes is performed, the difference between the final 2 characters drops out as irrelevant padding:

base64.b64decode('GZ/DdmzryjQOtOdQauTm/YxggbVM7EWelk8IBxsnC+c=')
b"\x19\x9f\xc3vl\xeb\xca4\x0e\xb4\xe7Pj\xe4\xe6\xfd\x8c\x81\xb5L\xecE\x9e\x96O\x08\x07\x1b'\x0b\xe7"
>>> base64.b64decode('GZ/DdmzryjQOtOdQauTm/YxggbVM7EWelk8IBxsnC+d=')
b"\x19\x9f\xc3vl\xeb\xca4\x0e\xb4\xe7Pj\xe4\xe6\xfd\x8c\x81\xb5L\xecE\x9e\x96O\x08\x07\x1b'\x0b\xe7"

First, I'm not sure that I'm analyzing this correctly. Maybe I'm not building AIDs or SAIDs correctly.

Secondly, do we care?

One consequence I can see is that naive comparisons of AIDs as strings (which would say A != B) are wrong... Another consequence is that a user could construct an AID or hash that looks different and fools a naive comparison algorithm, but is the same in a sophisticated algorithm. This could produce surprising results, which is perhaps a source of vulnerabilities.

Steps to reproduce

see above

dhh1128 commented 2 months ago

tagging @SmithSamuelM because I'm curious if I'm just analyzing this wrong. If so, egg on my face...

dhh1128 commented 2 months ago

Maybe we have already eliminated this ambiguity by disallowing in the spec any extra bits in the pad bytes at the end of a base64url-encoded value?

daidoji commented 2 months ago

@dhh1128 sorry to jump in here but I was curious about your question and I think I can answer it. All the codes are aligned on 4bits in the text domain and 3bits in the binary domain. In this specific case the AIDs are different if their keys are different as the B code doesn't have any padding.

However, in the B64 RFC they require that the encoding/decoding be byte aligned. This means that they'll (RFC compliant libraries) drop bits that aren't on 8-bit boundaries and do other strange things with the padding to make sure they hit that byte-aligned nature. This is probably what's happening here.

When you parse according to the cesr rules in the table, the two AIDs do show up as different payloads (the keys) as shown below from our implementation.

iex(1)> Kerilixir.Cesr.Primitive.CD_B_Ed25519N.from_b64("BGZ_DdmzryjQOtOdQauTm_YxggbVM7EWelk8IBxsnC-c")
{{:ok,
  %Kerilixir.Cesr.Primitive.CD_B_Ed25519N{
    code: "B",
    payload: <<102, 127, 13, 217, 179, 175, 40, 208, 58, 211, 157, 65, 171, 147,
      155, 246, 49, 130, 6, 213, 51, 177, 22, 122, 89, 60, 32, 28, 108, 156, 47,
      156>>
  }}, ""}
iex(2)> Kerilixir.Cesr.Primitive.CD_B_Ed25519N.from_b64("BGZ_DdmzryjQOtOdQauTm_YxggbVM7EWelk8IBxsnC-d")
{{:ok,
  %Kerilixir.Cesr.Primitive.CD_B_Ed25519N{
    code: "B",
    payload: <<102, 127, 13, 217, 179, 175, 40, 208, 58, 211, 157, 65, 171, 147,
      155, 246, 49, 130, 6, 213, 51, 177, 22, 122, 89, 60, 32, 28, 108, 156, 47,
      157>>
  }}, ""}

Also, here's an example of dropping bits by Elixir's b64 parser and how we get even stranger things from the padding (in the CESR scheme three underscores map to 6bits each so it should be <<255, 255, 3::size(2)>> in elixir this maps to 0b1111_1111 0b 1111_1111 0b11 in the python binary notation).

Base.url_decode64!("___", padding: false)
<<255, 255>>
iex(59)> Base.url_encode64(<<255, 255>>)
"__8="

Hope this helps.

daidoji commented 2 months ago

Maybe we have already eliminated this ambiguity by disallowing in the spec any extra bits in the pad bytes at the end of a base64url-encoded value?

There should never be any padding by design. Everything has to align on the 3-bit and 4-bit boundary by definition.

dhh1128 commented 2 months ago

Thanks, @daidoji .

What I can't reconcile is the B64 RFC saying we must be byte-aligned, and the CESR spec saying we never pad and we must be 3- and 4-bit aligned. These two things seem to introduce an undefined condition because CESR also requires these primitives to be base64url encoded.

dhh1128 commented 2 months ago

So, reading your link to the Base64 RFC, the sentence that jumps out to me is: "These pad bits MUST be set to zero by conforming encoders". What this means, I think, is that GZ/DdmzryjQOtOdQauTm/YxggbVM7EWelk8IBxsnC+d= is not canonical base64 text, and GZ_DdmzryjQOtOdQauTm_YxggbVM7EWelk8IBxsnC-d is not canonical base64url text. Therefore the CESR parser should reject it. If this is the intent, then there is only one valid way to encode a primitive, there are no synonyms, and a naive string comparison of identifiers is safe. However, there are some possible text sequences that do not produce valid identifiers, so we can't test for validity of identifiers just by regex.

SmithSamuelM commented 2 months ago

I believe most B64 decoders just drop the pad bits and pad chars which means if someone were to violate the spec MBZ requirement by not setting the pad bits to 0s it would still decode. This can't happen so easily in CESR since it Mid pads not tail pads to ensure 24 bit alignment. also a compliant CESR decoder will raise an exception if the mid pad is not zeroed. In CESR codes are left aligned and values are right aligned and any pad bits go in the middle. This makes codes stable ie not corrupted by the attached value and values stable ie not corrupted by the preceding code. It would be a hugely bad idea from a security perspective for any implementation to not enforce zeros in the mid pad bits this could leave that implementation vulnerable to a malleability attack

kentbull commented 2 months ago

I had the benefit of Sam responding while I had this answer mostly prepared 😄 His answer is much shorter than mine. If you want a bit more long and drawn out answer, here you go.

I will attempt adding some illustration here. In short, there are no synonyms because of how pre-encoding bit padding to 24 bit boundaries works with both CESR and Base64 URL Safe. For the rest of this comment Base64 means Base64URLSafe. The core reason why your initial statement

these AIDs reference the same key because when the base64url transformation back to raw bytes is performed, the difference between the final 2 characters drops out as irrelevant padding:

is incorrect is because the final characters 'c' and 'd' share bits with the last byte in the cryptographic primitive adjacent to the last character in the stream next to the padding, either 'c' or 'd' in this case. The visuals I include below make this clear. Bit sharing between the last character adjacent to padding always occurs for cryptographic primitives that require padding to align on 24 bit boundaries. So ignoring these last characters will always result in different SAIDs because you are ignoring some of the pad bits encoded in the characters adjacent to pad characters.

What can get confusing is that CESR pad characters occur at the start of the string and CESR replaces those pad characters with the derivation code, which is B in this case, the derivation code for an Ed25519 non-transferable identifier.

Wrapping up the bit sharing, as 'c' is 011100 in binary and 'd' is 011101 then that last 1 in the d is not padding and is instead a part of the cryptographic primitive you are translating, in this case the SAID of an Ed25519 non-transferable identifier.

Padding

CESR does pre-padding (on the left/start), Base64 does post-padding (on the right/end). Both do their padding, or are supposed to do their padding, with the raw bits they are sent before performing the encoding process where 6-bit groups of raw bytes are encoded in a Base64 character. See section 9.3 - conversions where Sam states that

[t]his means that the length of any Primitive in the ‘B’ domain MUST be an integer multiple of three binary bytes with a minimum length of three binary bytes.

The only way to get all primitives to a length that is an integer multiple of three binary bytes in the binary ('B') domain prior to encoding them as Base64 is to do pre-padding. Why do this pre-padding? To ensure the total length of the primitive always aligns on a 24 bit boundary so that, as the CESR spec says,

the conversion of concatenated Primitives in one Domain never results in the same byte or character in the converted Domain sharing bits from two adjacent Primitives

which means that two adjacent primitives never share bits.

From the Base64 spec in section 4 we see pad bits are added on the right in Base64 encoding.

When fewer than 24 input bits are available in an input group, bits with value zero are added (on the right) to form an integral number of 6-bit groups.

The problem with Base64 encoding alone of a concatenated string of cryptographic primitives is that there are no clear boundaries on where one cryptographic primitive starts and ends without keeping additional parsing context somewhere, which would overly complicate parsing instructions. This is why I believe Sam calls such Base64 encoding "naive Base64."

To get into the "why" a bit around 24 bit boundaries, due to pre-padding to align on 24-bit boundaries CESR parser can largely act like a TLV (type, length, value) parser, along with some contextual rules found in KERIpy, which allows in-stride stripping of primitive-length bytes from a stream without having to parse the entire primitive, enabling pipelining. TLV parsers are much easier to write and as mentioned provide the capability for offloading primitives to different CPU cores through pipelining.

One reason why aligning on 24 bit boundaries is so important, as Sam mentioned just now, is that some B64 decoders may still decode a string even if the pad characters are not correctly appended, meaning someone violated the spec.

most B64 decoders just drop the pad bits and pad chars which means if someone were to violate the spec MBZ requirement by not setting the pad bits to 0s it would still decode.

A Base 64 decoder may still decode a value even without the pad bits, meaning you would get a different value than you encoded, because the bits would be interpreted differently than the would be with proper padding present.

Visualizing the difference

See the below bit diagrams for clarity on how CESR and Base64 do bit padding in bytes prior to encoding raw bytes into Base64 characters.

CESR Pre-Padded Encoded Digest Diagram

As you can see the bit padding occurs on the front as a single pad byte (in the B-binary domain) and character C43, or P (in the T-text domain) shares some raw bits with the encoded primitive while also including two bad bits of the single pad byte that was prepended by CESR. For the calculation of how many pad bytes to include prior to encoding see the qb64b function in the Typescript implementation of SAIDification I wrote or Matter._infil in KERIpy.

In CESR the pad character C44 would be encoded as an "A" which is equivalent to six zero bits in the Base64 alphabet. CESR then replaces this C44 character with the derivation code B for the Ed25519 non-transferable identifier.

Naive Base64 Post-Padded Encoded Digest Diagram

Now in Base64 the bit padding occurs at the end which means that the bit sharing occurs in character C2 adjacent to the pad character C1. In naive Base64 C1 would be an all zero bit character, or an equals sign, which is Base64's padding character.

Inline responses

these AIDs reference the same key because when the base64url transformation back to raw bytes is performed, the difference between the final 2 characters drops out as irrelevant padding:

The reason this is incorrect is due to bit sharing. A simple Base64url transformation expects post-padding, which is not what CESR creates because CESR does pre-padding. Yet, regardless of pre or post padding you can't drop the 'c' and 'd' characters because they have pad bits mixed with raw bits, what I am calling bit sharing.

You must pre-pad with CESR prior to doing the Base64URL encoding in order to get a CESR-compatible value.

One consequence I can see is that naive comparisons of AIDs as strings (which would say A != B) are wrong

No, you should always be able to directly compare AIDs as strings as long as they are properly encoded.

a user could construct an AID or hash that looks different and fools a naive comparison algorithm, but is the same in a sophisticated algorithm

Again, a properly encoded AID will always be the same, so no fooling of algorithms. The reason the value stays stable as Sam mentioned above is the alignment on 24 bit boundaries which gives you a stable value that always encodes to the same set of Base64 characters. There actually are pad characters and they are essential to the Base64URLSafe decoding process to get from Base64 characters back to raw primitive bytes.

There should never be any padding by design.

This is not quite correct. You probably meant "There should never be any padding characters by design" because CESR replaces pad characters with derivation codes.

What I can't reconcile is the B64 RFC saying we must be byte-aligned, and the CESR spec saying we never pad and we must be 3- and 4-bit aligned.

The CESR spec says you must pad with a pre-pad pre-conversion, what Sam labeled a "mid-pad." See CESR section 10.1 and CESR section 10.2 for a comprehensive description of this.

kentbull commented 2 months ago

So my comment earlier is slightly inaccurate regarding Base64 and pre-padding. Base64 does not do pre-padding prior to conversion. It instead adds padding as necessary to the final 6 bit group that includes bits from the raw bytes being encoded. There is not a concept of pad bytes in Base64, only a concept of pad bits and pad characters. The pad bits are only present in the last base64 character adjacent to the padding characters, when there are pad characters.

The number of pad characters '=' tells the Base64 decoder how many pad bits from the character adjacent to the pad bit to remove when converting the Base64 encoded value back to the original raw value.

I clarified everything from our conversation here regarding SAIDs into my latest blog post including fixed diagrams. If anything regarding SAIDs remains unclear after reading that post then let me know and I'll update the post.

dhh1128 commented 2 months ago

Thank you for the good info, @daidoji , @kentbull , and @SmithSamuelM . I will close the ticket so it is not cluttering our to-do lists.

WebOfTrust / keripy