Ambiguity on what a character is (regardless of encoding, in two usages)

cigitia commented 9 years ago

The specification refers to “characters” in two separate places:

Usage 1: The c scalar type, which is an extension type of the s type.
Usage 2: “Scalar values have single-character tags and composite values have multi-character tags.”

But there are are many definitions of “character”, so both of these usages are ambiguous.

Usage 1

Usage 1 has the following ambiguities/issues:

Many programming languages (Java, C#, Dart, etc.) are strongly coupled to 16-byte code points from the time before the Unicode code space was extended by orders of magnitude. Since those languages’ native char types allow only Basic Plane (BMP) code points (since those are what can be represented by single 16-byte code units), their Transit libraries’ current implementations interpret Transit c scalars as those char types when reading Transit data.
Allowing only single BMP code points to be “characters” also unfortunately excludes many now-commonly used characters in the Supplementary Planes (SMPs) from being interchanged as c values. SMP characters include, in particular, many symbols and emoji in use, such as 𝄫, 😀, 🐴, and 📞, and are supported by languages such as Go (with its native rune type), as well as quasi-supported by any language that uses strings for characters (JavaScript, Python, Ruby, etc.). Support for SMP characters may be especially important in internationalization projects (indeed, for many of my own projects).
Also of note is that some languages do not support even BMP code points, let alone the Unicode code space in general. OCaml’s Character and String types, for instance, use eight-bit bytes, which essentially covers only ASCII. Ruby’s String class also splits into eight-bit bytes, but since it has no concept of a “Character” class, this is mostly moot anyway.
Lastly, the EDN specification also leaves this ambiguous, but Clojure itself, being hosted on the JVM (whose char type is BMP-only / strongly coupled to 16 bytes), does not allow Supplementary Plane characters.

So, question 1 is: What values is the c scalar Transit type allowed to contain (by its read and write handlers)? I see at least three options:

Option A: A Transit c value is any single Unicode code point (from U+0 to U+10FFFF). This allows characters in Supplementary Planes such as emoji to be easily interchanged between programming languages whose “character” types support them (Python, sort of Ruby and JavaScript). However, it also will necessitate Transit readers in languages with 16-byte, BMP-only char types to throw errors or use other data types if they encounter any SMP characters. (Note that this is already a problem in general: for instance, if a Transit UUID is invalid, a runtime error may still occur in some readers.)
Option B1: A Transit c value is any single 16-byte, BMP-only (from 0000 to FFFF). No Transit program can interchange SMP characters such as emoji using the core c type; however, 16-bit-char programming languages such as Java and C# are guaranteed to accept any c value. (Of note is that, in this case, people who need to use also SMP characters can define an extension type, but unfortunately this ceases to be universal.)
Option B2: The same as Option B1, except that another Transit core scalar type is added, “C” or “y” or something, which extends s and which represents a single, potentially Supplementary Unicode code point between U+0 and U+10FFFF. Languages that map the already-existing c type to their 16-bit-char types would map this new type to their string types or something.

I personally anticipate option B1 to be chosen, since it's what Clojure itself does and takes the least work, ~~but I'm still throwing option A and B2 in the hopes that they too would be considered~~ I now prefer that chars be clearly equated to UTF-16 code units after reading the Unicode FAQs discuss preferring UTF-16 code units for low-level indexing and strings for everything else. Any way would create more work for someone, but the question is which one is most worth it, and the specification probably should clarify this matter in any case.

Question 2: If the answer for question 1 is “16-byte, BMP code points only / no SMP characters allowed in Transit c values”, then should Transit writers (in those languages that support SMP characters) ensure that no Supplementary code points are ever written into Transit data as Transit c values?

Usage 2

For usage 2, there are multiple questions to be clarified:

Question 3: Are only 16-byte/BMP code points or any Unicode code point allowed to be used as scalar-type tags?

Question 4: If a single SMP character is used as a type tag, is it a scalar tag (because it is a single Unicode code point) or is it a composite tag (because it is two 16-byte surrogate units)? (This is essentially equivalent to question 2.)

Question 5: Are whitespace characters allowed in type tags?

Question 6: Are control characters allowed in type tags?

Question 7: Are noncharacter code points (such as U+FDD0 and U+FFFE) allowed in type tags?

Question 8: If the answers for question 3 is “only 16-bytes/BMP points for scalar tags” or for questions 5–7 is “no”, then should writers ensure that that the prohibited code points are never used?

These are fastidious, technical questions, but I think they're important to disambiguating Transit's behavior. Question 1 especially affects how people like me use Transit. Thanks!

jlouis commented 9 years ago

it is worth noting that the wire format is usually either JSON or Msgpack. For the latter, Msgpack, strings are always UTF-8 encoded, so any handling of UTF-16 is an implementation specific question. Some languages, Go or Erlang for instance, have direct UTF-8 representations of Unicode built-in, so they don't worry too much about the UTF-16 question.

For JSON, you can pick any encoding (the RFC mentions a SHALL for UTF-8, UTF-16 and UTF-32) with the default being UTF-8 in most cases. The same thing applies: how the implementation maps this into UTF-16 is specific to that implementation.

As for tags, I imagined tags to be from the ASCII character set only, because this is what most implementations will assume. If you opt to extend the format with an emoji as a tag, I don't think it will cause trouble. Any parser has to read either JSON or Msgpack which means that they should already have internalized their data according to that parser. If you present an emoji, they should in principle be able to read that. However, I think that you will find that some implementations will flounder on this and reject your emoji-tag.

In short, I expect tags to pass the regex [a-zA-Z]+, and I think most implementations will.

Finally, as for OCaml, I skipped totally on this in jlouis/transit-ocaml, since I don't think Core handles Unicode yet.

cigitia commented 9 years ago

@jlouis Thanks for responding. For questions 3–8, restricting tag characters to [a-zA-Z] would be fine, though it does limit the number of scalar tags (which must be single characters) to fifty-two. Perhaps the specification may consider that practically sufficient, which is probably true, as long as it's clear.

In addition, even if only [a-zA-Z] are allowed, question 8 still would be unresolved (should read and write handlers ensure that all other characters are not used in their tags)?

For the latter, Msgpack, strings are always UTF-8 encoded, so any handling of UTF-16 is an implementation specific question…For JSON…The same thing applies: how the implementation maps this into UTF-16 is specific to that implementation.

The thing about question 1 is that it's not affected by which particular encoding form (UTF-8 or UTF-16) is used to code string values—it’s about which string values are allowed for c scalars at all, on the read-handler and write-handler levels, regardless of its particular encoding.

For instance, if a t-tagged scalar’s string representation does not conform to RFC 3339, a Transit read handler that maps t-tagged scalars to java.util.Date would presumably throw an error, regardless of the encoding of the string representation.

Similarly, the read handler of the c tag (and the write handlers of languages' character data types) expect only certain string representations.

The code point U+0078 (“x”) is coded in Transit by the string "~cx", regardless of the particular encoding.
Obviously, "~cxxxxxxxxxxxxxxxxx" should be rejected by every c-tag read handler. Similarly, if you tried to encode the string "xxxxxxxxxxxxxxxxx" as a c-type scalar, the write handler should reject it.
But would "~c📞" be allowed (still regardless of the string's encoding) by the read handler? Would "📞" be allowed by the write handler?

The scalar "~c📞" is theoretically possible to read into JavaScript, Ruby, and Python (which just use strings for characters), as well as languages like Go, whose rune type covers the entire Unicode code space.

But the scalar cannot be read into Java, C#, or Dart, because c-tagged scalars are mapped to things like java.lang.Character, which support BMP / 16-bit code units only. transit-clj's and transit-java's c-tag read handler probably would throw an error when reading "~c📞", because to Java, C#, etc., 📞 is “two” “characters”, for better or for worse.

Question 1 asks whether Java/C#/etc.’s inability to represent SMP characters with its built-in char types should prohibit SMP characters from being c-tagged scalar values at all (for interchange between the languages that do support it with their respective char types, such as Go). This remains a real ambiguity in the data format, as far as I can tell.

If the answer is, “Yes, SMP characters are prohibited from being values of c-tagged scalars,” then it also asks whether it would be worth creating a new core Transit scalar type that does support the entire Unicode code space, including both the BMP and the SMPs.

cognitect / transit-format

Ambiguity on what a character is (regardless of encoding, in two usages) #31

Usage 1

Usage 2