That UTF-8 is the only reasonable external representation of text is clear.
That the JCS specification (at the time of writing) sorts properties based on UTF-16 code units is indeed for maintaining optimal performance on legacy software platforms, but also due to JSON itself which specifies Unicode string escapes as UTF-16 constants. If there actually is a problem using UTF-16 for sorting, wouldn’t this affect JSON as well?
I am not in any way married to using UTF-16 as the foundation for sorting but since sorting is an internal operation, I do not see that changing this as a necessity for interoperability.
https://unicode.org/faq/utf_bom.html states the following regarding the different UTF variants:
The conversions between all of them are algorithmically based, fast and lossless
The Go implementation verified that UTF-16 is not a problem, only a single line was required to get data in the proper format:
This issue was based on external input:
That UTF-8 is the only reasonable external representation of text is clear.
That the JCS specification (at the time of writing) sorts properties based on UTF-16 code units is indeed for maintaining optimal performance on legacy software platforms, but also due to JSON itself which specifies Unicode string escapes as UTF-16 constants. If there actually is a problem using UTF-16 for sorting, wouldn’t this affect JSON as well?
Just to make things more complicated, other JSON canonicalization schemes rather prescribe using Unicode code points: https://tools.ietf.org/html/rfc7638 https://gibson042.github.io/canonicaljson-spec/
See: https://github.com/cyberphone/json-canonicalization/issues/5
I am not in any way married to using UTF-16 as the foundation for sorting but since sorting is an internal operation, I do not see that changing this as a necessity for interoperability.
https://unicode.org/faq/utf_bom.html states the following regarding the different UTF variants: The conversions between all of them are algorithmically based, fast and lossless
The Go implementation verified that UTF-16 is not a problem, only a single line was required to get data in the proper format: