Open timmc opened 10 years ago
The question is actually: "is there a default encoding of the document or is this negotiable?" In practice, is it nailed to be UTF-8 (in which case you need surrogate pairs to represent U+10000 and on outside the BMP), or is it something else.
There are actually two questions, one of which I think is already answered:
The JSON spec suggests using UTF-8, but it doesn't demand it. I think it would be appropriate for Transit to lock this down so that we don't get nasty character encoding issues between platforms with different system defaults (e.g. Windows-1252 in Windows with English locales.)
(As for MessagePack, a quick glance suggests it already specifies an encoding of UTF-8.)
Are strings encoded in UTF-8, UTF-16, UCS-2, or what? For example, how would the String value
"𐀀"
(U+10000) be encoded in the various formats? (That is, what character encoding is used for Transit when encoding JSON to bytes?)(Edit: Removed further question about illegal bytes in raw-string inputs in various languages.)
(Sorry for the thrashing -- I'm reopening this now that I see that yes, Github Issues are in fact being used for this project.)