cyberphone / json-canonicalization

JSON Canonicalization Scheme (JCS)
Other
94 stars 23 forks source link

Sorting should be based on UTF-8 values #6

Closed cyberphone closed 5 years ago

cyberphone commented 5 years ago

This issue was based on external input:

That UTF-8 is the only reasonable external representation of text is clear.

That the JCS specification (at the time of writing) sorts properties based on UTF-16 code units is indeed for maintaining optimal performance on legacy software platforms, but also due to JSON itself which specifies Unicode string escapes as UTF-16 constants. If there actually is a problem using UTF-16 for sorting, wouldn’t this affect JSON as well?

Just to make things more complicated, other JSON canonicalization schemes rather prescribe using Unicode code points: https://tools.ietf.org/html/rfc7638 https://gibson042.github.io/canonicaljson-spec/

See: https://github.com/cyberphone/json-canonicalization/issues/5

I am not in any way married to using UTF-16 as the foundation for sorting but since sorting is an internal operation, I do not see that changing this as a necessity for interoperability.

https://unicode.org/faq/utf_bom.html states the following regarding the different UTF variants: The conversions between all of them are algorithmically based, fast and lossless

The Go implementation verified that UTF-16 is not a problem, only a single line was required to get data in the proper format:

    sortKey := utf16.Encode([]rune(rawUTF8))