Enforce Canonical Base 64 encoding.

zamicol commented 1 year ago

Playground demonstrating the issue:

Using Coze https://go.dev/play/p/l_dZ9q4DZAA
Pure Go (non-strict): https://go.dev/play/p/N1rZckATLOf
Pure Go (strict): https://go.dev/play/p/t2nXCD8VEmw // Correctly errors.

There's an apparent problem with RFC 4648. There are three places base 64 representation may contain string variation:

Padding
Alphabet (URI unsafe or URI safe)
Canonical encoding (various characters can encode to the same byte string, but there is only one canonical decoding)

What is "canonical encoding"? From the last three characters of the example tmb, "cLj8vs...XNuhOk", the values hOk and hOl may both decode to the same byte value (in Hex, 84E9) even though they are different UTF-8 values. (Example decoding hOk and hOl.) The canonical encoding is hOk

The RFC specifically addresses 1 and 2, but not really 3.

RFC 4648 advises to reject non-alphabet characters, which can include padding. I agree with this advice:

Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise. [...] Furthermore, such specifications MAY ignore the pad character, "=", treating it as non-alphabet data[.]

I don't see the RFC really address the to the third concern.

Behavior

Obviously non-"strict"/non-canonical base 64 encoding is incorrect, and any encoder producing non-strict encoding should be fixed. However the question is what should Coze specify regarding non-strict encoding/decoding? Both Go and Javascript are permissive when decoding and do not throw errors.

Ultimately, the concern is different base 64 encoders/decoders may have different behavior. Ideally, Coze should specify the appropriate behavior for Coze. Section 3.5 mentions non-canonical encoding in the context of unpadded data and this issues is unrelated to padding (hOk= and hOl=, both padded, have the same issue as unpadded strings).

The concern is that if a Coze implementation used string comparison instead of byte comparison, this could result implementations disagreeing about valid messages. For example, with a non-strict tmb encoded string, if a Coze implementation checks tmb before cryptographic verification, it may check this based on the string value or the byte value, and comparing the string value or the byte value will result in different behavior.

Another note for any Coze restriction on encoding: JSON is base 64 unaware, any sort of Coze specified enforcement of base 64 encoding can only be applied to Coze known fields with type b64ut, and cannot be applied generally to any b64ut value.

Solutions

There appears to be only two options to handle this:

Be permissive on inbound encoding, force strict outbound encoding.
Force strict encoding and decoding. (This can only be done when type is known to be b64ut.)

2 is more conservative, but may require unnecessary checks that don't really add value. 1 has the potential to be more compatible if assuming that systems can decode permissively (other programming language's base 64 libraries decode permissively), which may be a bad assumption.

Regardless, I believe that 1 is the correct behavior here. Even if languages/system do no error on non-canonical encoding, implementing an encoding error can be implemented by re-encoding the decoded data and comparing strings.

Security Considerations

This base 64 decoding bug doesn't appear to be a structural/architectural/security concern since Coze uses the UTF-8 encoding of the string for signing and verification, however it is a interesting problem that should be known when working with RFC base 64. Concerning specifically replay attacks, signatures are still not malleable as payloads are UTF-8 encoded and the signing operation is not base 64 aware.

If Coze used the base 64 representation directly, this would be a security concern and could result in reply attacks.

Notes

It should be obvious, but this situation also applies to the URI unsafe alphabet and messages with base 64 padding, which all are interpreted as the same bytes. (My conversion tool only has "base64 as an input and not the various permutations since all variations can be known (or is irrelevant) and results in the sames decoded binary payload.

RFC 4648

I currently have errata open on one of the relevant sections.

I'm going to implement a non-canonical encoding check on Go and JS Coze.

Cyphrme / Coze