Base64 encoding can only elide padding when the size of encoded data is known

peterbourgon commented 1 year ago

https://github.com/Cyphrme/Coze/blob/01c154e4024b4e876b8d152166ce85cf2a945e22/README.md#coze-fields

Binary values are encoded as RFC 4648 base64 URI with padding truncated (b64ut).

https://www.rfc-editor.org/rfc/rfc4648#section-3.2

when assumptions about the size of transported data cannot be made, padding is required to yield correct decoded data.

As far as I can tell, the size of binary values is not communicated to recipients, and therefore padding should not be truncated. (The URI encoding is also non-standard.)

zamicol commented 1 year ago

How is the URI encoding non-standard?

There's some general things about RFC base64 that should be said first: Padding characters help satisfy length requirements and carry no other meaning. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence. Since Coze does not concatenate unpadded base64 strings, Coze does not need base64 padding. Coze only concatenates the binary form, not base64.

For RFC base64 /w== and /w are equal to 11111111. The padding doesn't do anything. Also, omitting padding saves precious message space.

To summarize, there's a few reasons why padding is not needed:

The purpose of padding is to explicitly denote "empty bytes". It's already redundant.
It's in JSON.
Digests and Cryptographic signatures.

Padding is never needed to correctly decode a base64 message as long as the whole message was transported.
Coze also doesn't have to worry about base64 concatenation since base64 is in JSON. Double quote serves as the base64 string terminator.
If something went wrong with encoding or transport, Coze verification simply fails since cryptographic functions are serving the function of integrity checking. As long as a Coze message verifies, it's also integral.

As an aside, JOSE already does the same thing, and it's widely used in industry.

As a historical anecdote, Coze used to encode with Hex, because it is more human readable, Hex is always twice as large as the binary form, and it doesn't need padding. On the other hand b64ut does not have a static multiplicative relationship with binary, is less human readable, and for a few edge cases padding can be useful. After considering the message size savings, we dropped Hex in favor of RFC b64ut.

Satoshi had the same concerns with base64 and thus base 58. We decided that RFC base64 is good enough and that implementing an alternative base conversion system would be more trouble than it's worth. That has not stopped others from doing so (See Keybase's solution is linked with others at the bottom of the base conversion tool) If we had chosen a different base conversion method, I would have liked to use a higher base (like a base 91 alphabet) which results in shorter sized messages. However, then character escaping become an issue. At that point, a purely binary form of Coze would be better. Base64 is "right sized", it has enough characters to make messages reasonable short, while not having so many that it requires an excessive amount of escaping for various applications.

zamicol commented 1 year ago

I don't mean "close the issue" for no more feedback, but I don't believe this is a concern. (I'm a bit of a GitHub dunce, please forgive any of my social blunders done by clicking green buttons.)

I appreciate you reading Coze and poking holes into it. This is exactly what needs to be done, and I want to motivate skepticism as much as I can.

peterbourgon commented 1 year ago

How is the URI encoding non-standard?

You use base64.URLEncoding. That is described as

URLEncoding is the alternate base64 encoding defined in RFC 4648. It is typically used in URLs and file names.

whereas base64.StdEncoding is described as

StdEncoding is the standard base64 encoding, as defined in RFC 4648.

Padding characters help satisfy length requirements and carry no other meaning. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.

As far as I can tell, this is a mis-reading of the relevant requirements, and not correct. It is only possible to unambiguously decode a base64 encoded string in isolation if padding characters are included. If padding characters are elided, then it is only possible to unambiguously decode that string if the length is communicated out-of-band.

Quoting https://www.rfc-editor.org/rfc/rfc4648#section-3.2

   In some circumstances, the use of padding ("=") in base-encoded data
   is not required or used.  In the general case, when assumptions about
   the size of transported data cannot be made, padding is required to
   yield correct decoded data.

   Implementations MUST include appropriate pad characters at the end of
   encoded data unless the specification referring to this document
   explicitly states otherwise.

zamicol commented 1 year ago

base64.StdEncoding

It's a matter of semantics. base64.URLEncoding is standardized by the same RFC. We've dubbed it more specifically b64ut. Even though it's not what the RFC names as the standard alphabet, URI encoding is standardized formally by that RFC 4648.

We especially felt the need to dub it b64ut to avoid confusion with the generalized arbitrary base 64 which uses the "iterative divide by radix" method and is sometimes equal to RFC base64.

The JOSE JWS RFC specifically addresses that:

As per the example code above, the number of '=' padding characters that needs to be added to the end of a base64url-encoded string without padding to turn it into one with padding is a deterministic function of the length of the encoded string. Specifically, if the length mod 4 is 0, no padding is added; if the length mod 4 is 2, two '=' padding characters are added; if the length mod 4 is 3, one '=' padding character is added; if the length mod 4 is 1, the input is malformed.

And that is correct. Padding is always deterministically recreatable as long as the original message is given.

RFC 4648 is basically referring to streaming, where the original message may not be given, and I think it's one of the more confusingly worded section. If a stream ends mid stream, without padding it may not be known that the stream ended or if there's an error. For batch processing, this isn't relevant. Firstly, the transport (TCP) will most likely error, then JSON itself will be malformed, the digest will be bad, and the cryptographic signature will not be valid. There's many layers of defense against having to worry about padding in Coze.

peterbourgon commented 1 year ago

It's a matter of semantics. base64.URLEncoding is standardized by the same RFC. We've dubbed it more specifically b64ut. Even though it's not what the RFC names as the standard alphabet, URI encoding is standardized formally by that RFC 4648.

"Standard" does not mean "any one of the alphabets defined by the authoritative RFC", it means "the specific alphabet denominated as Standard by the authoritative RFC", which is explicitly not the URI encoding.

The JOSE JWS RFC

Where is this RFC referenced?

Padding is always deterministically recreatable as long as the original message is given . . . RFC 4648 is basically referring to streaming, where the original message may not be given, and I think it's one of the more confusingly worded section. If a stream ends mid stream, without padding it may not be known that the stream ended or if there's an error.

None of these claims are correct. Padding is not deterministically re-createable, because "the original message" is not guaranteed to be knowable by a given recipient. Neither does RFC 4648 apply only to "streaming" use cases.

I'll disengage at this point.

zamicol commented 1 year ago

The JOSE JWS RFC 7515 which also doesn't use padding. See in particular Appendix C.

is not guaranteed to be knowable by a given recipient

We wrote the Coze spec assuming that systems can can calculated the length of the digests given in messages, but even if that capability is not present in a particular system, that system still can use JSON validation, digests, or cryptographic verification to ensure message are well-formed. So even for systems that for some reason cannot calculate the length of the base 64 messages, padding is still not needed.

The only time padding cannot be deterministically reconstructed is if the base 64 payload is malformed or if the receiving end doesn't have the capability to calculate the length of the payload, which is a weird and easily solvable problem on modern systems. Perhaps there's a technical edge case where this is a concern for minimal hardware system that are implementing Coze? If you have something in particular in mind, I'd like to know more about those technical constraints. Go Coze and Javascript Coze have no issue implementing this constraint. If the length can be known by the payload, it's always possible to determine the length of the input unambiguously from the length of the encoded sequence.

A good argument is made by Appendix C. It appears to me that there's no need to be concerned about padding since the code needed to reconstruct padding, if needed, is minimal and straightforward.

zamicol commented 1 year ago

I opened a new issue that's related to base 64 encoding: https://github.com/Cyphrme/Coze/issues/18

Cyphrme / Coze

Base64 encoding can only elide padding when the size of encoded data is known #17