cognitect / transit-format

A data interchange format.
1.88k stars 36 forks source link

Please clarify String byte-encoding format #18

Open timmc opened 10 years ago

timmc commented 10 years ago

Are strings encoded in UTF-8, UTF-16, UCS-2, or what? For example, how would the String value "𐀀" (U+10000) be encoded in the various formats? (That is, what character encoding is used for Transit when encoding JSON to bytes?)

(Edit: Removed further question about illegal bytes in raw-string inputs in various languages.)

(Sorry for the thrashing -- I'm reopening this now that I see that yes, Github Issues are in fact being used for this project.)

jlouis commented 10 years ago

The question is actually: "is there a default encoding of the document or is this negotiable?" In practice, is it nailed to be UTF-8 (in which case you need surrogate pairs to represent U+10000 and on outside the BMP), or is it something else.

timmc commented 10 years ago

There are actually two questions, one of which I think is already answered:

  1. When a String value's characters are expressed in JSON, what encoding is used? (Answer from spec: No encoding needed for non-ASCII, but UTF-16 is to be used for any unicode character escapes.)
  2. When the JSON is then written to a byte-oriented medium, what encoding is used for the character->bytes conversion?

The JSON spec suggests using UTF-8, but it doesn't demand it. I think it would be appropriate for Transit to lock this down so that we don't get nasty character encoding issues between platforms with different system defaults (e.g. Windows-1252 in Windows with English locales.)

(As for MessagePack, a quick glance suggests it already specifies an encoding of UTF-8.)