cognitect / transit-format

A data interchange format.
1.88k stars 36 forks source link

Ambiguity on what a character is (regardless of encoding, in two usages) #31

Open cigitia opened 9 years ago

cigitia commented 9 years ago

The specification refers to “characters” in two separate places:

But there are are many definitions of “character”, so both of these usages are ambiguous.

Usage 1

Usage 1 has the following ambiguities/issues:

So, question 1 is: What values is the c scalar Transit type allowed to contain (by its read and write handlers)? I see at least three options:

I personally anticipate option B1 to be chosen, since it's what Clojure itself does and takes the least work, but I'm still throwing option A and B2 in the hopes that they too would be considered I now prefer that chars be clearly equated to UTF-16 code units after reading the Unicode FAQs discuss preferring UTF-16 code units for low-level indexing and strings for everything else. Any way would create more work for someone, but the question is which one is most worth it, and the specification probably should clarify this matter in any case.

Question 2: If the answer for question 1 is “16-byte, BMP code points only / no SMP characters allowed in Transit c values”, then should Transit writers (in those languages that support SMP characters) ensure that no Supplementary code points are ever written into Transit data as Transit c values?

Usage 2

For usage 2, there are multiple questions to be clarified:

Question 3: Are only 16-byte/BMP code points or any Unicode code point allowed to be used as scalar-type tags?

Question 4: If a single SMP character is used as a type tag, is it a scalar tag (because it is a single Unicode code point) or is it a composite tag (because it is two 16-byte surrogate units)? (This is essentially equivalent to question 2.)

Question 5: Are whitespace characters allowed in type tags?

Question 6: Are control characters allowed in type tags?

Question 7: Are noncharacter code points (such as U+FDD0 and U+FFFE) allowed in type tags?

Question 8: If the answers for question 3 is “only 16-bytes/BMP points for scalar tags” or for questions 5–7 is “no”, then should writers ensure that that the prohibited code points are never used?

These are fastidious, technical questions, but I think they're important to disambiguating Transit's behavior. Question 1 especially affects how people like me use Transit. Thanks!

jlouis commented 9 years ago

it is worth noting that the wire format is usually either JSON or Msgpack. For the latter, Msgpack, strings are always UTF-8 encoded, so any handling of UTF-16 is an implementation specific question. Some languages, Go or Erlang for instance, have direct UTF-8 representations of Unicode built-in, so they don't worry too much about the UTF-16 question.

For JSON, you can pick any encoding (the RFC mentions a SHALL for UTF-8, UTF-16 and UTF-32) with the default being UTF-8 in most cases. The same thing applies: how the implementation maps this into UTF-16 is specific to that implementation.

As for tags, I imagined tags to be from the ASCII character set only, because this is what most implementations will assume. If you opt to extend the format with an emoji as a tag, I don't think it will cause trouble. Any parser has to read either JSON or Msgpack which means that they should already have internalized their data according to that parser. If you present an emoji, they should in principle be able to read that. However, I think that you will find that some implementations will flounder on this and reject your emoji-tag.

In short, I expect tags to pass the regex [a-zA-Z]+, and I think most implementations will.

Finally, as for OCaml, I skipped totally on this in jlouis/transit-ocaml, since I don't think Core handles Unicode yet.

cigitia commented 9 years ago

@jlouis Thanks for responding. For questions 3–8, restricting tag characters to [a-zA-Z] would be fine, though it does limit the number of scalar tags (which must be single characters) to fifty-two. Perhaps the specification may consider that practically sufficient, which is probably true, as long as it's clear.

In addition, even if only [a-zA-Z] are allowed, question 8 still would be unresolved (should read and write handlers ensure that all other characters are not used in their tags)?


For the latter, Msgpack, strings are always UTF-8 encoded, so any handling of UTF-16 is an implementation specific question…For JSON…The same thing applies: how the implementation maps this into UTF-16 is specific to that implementation.

The thing about question 1 is that it's not affected by which particular encoding form (UTF-8 or UTF-16) is used to code string values—it’s about which string values are allowed for c scalars at all, on the read-handler and write-handler levels, regardless of its particular encoding.

For instance, if a t-tagged scalar’s string representation does not conform to RFC 3339, a Transit read handler that maps t-tagged scalars to java.util.Date would presumably throw an error, regardless of the encoding of the string representation.

Similarly, the read handler of the c tag (and the write handlers of languages' character data types) expect only certain string representations.

The scalar "~c📞" is theoretically possible to read into JavaScript, Ruby, and Python (which just use strings for characters), as well as languages like Go, whose rune type covers the entire Unicode code space.

But the scalar cannot be read into Java, C#, or Dart, because c-tagged scalars are mapped to things like java.lang.Character, which support BMP / 16-bit code units only. transit-clj's and transit-java's c-tag read handler probably would throw an error when reading "~c📞", because to Java, C#, etc., 📞 is “two” “characters”, for better or for worse.

Question 1 asks whether Java/C#/etc.’s inability to represent SMP characters with its built-in char types should prohibit SMP characters from being c-tagged scalar values at all (for interchange between the languages that do support it with their respective char types, such as Go). This remains a real ambiguity in the data format, as far as I can tell.

If the answer is, “Yes, SMP characters are prohibited from being values of c-tagged scalars,” then it also asks whether it would be worth creating a new core Transit scalar type that does support the entire Unicode code space, including both the BMP and the SMPs.