Invalid UTF-8 character numbers in stringUtf8.ion

siler commented 4 years ago

I ran up against a challenge with a specific test: iontestdata/good/equivs/utf8/stringUtf8.ion.

The final two equivalency sexps in the test include invalid UTF-8 characters (the UTF16 surrogate characters). According to RFC3629:

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

The Ion text spec has the following to say:

A value stream in the text encoding must be a valid sequence of Unicode code points in any of the three encoding forms (i.e. UTF-8, UTF-16, or UTF-32).

Looking at the four bytes that make up the challenging characters in question, both sequences start with 0b1111_0xxx which indicates a four byte UTF-8 code point. This confirms the UTF-8 encoding of the characters within the invalid range, and indicates that these unescaped strings are invalid UTF-8.

This was identified by Rust's String type (which only supports valid UTF-8).

I can submit a PR to remove those two equivalencies if that seems like the correct approach here. It seems incorrect to include them as escape sequences as well in a document that is UTF-8 encoded unless there is some kind of indication that it should be encoded as UTF-16 prior to escape expansion since the resulting UTF-8 string would also be invalid if the escape were expanded. I would suggest making specific files for different Unicode encodings if this kind of test is desired for parsing UTF-16 or UTF-32 data.

tgregg commented 4 years ago

For the second-to-last sequence, the final string contains the UTF-8 code units \xF0\x90\x80\x80, and in the last sequence the final string contains the code units \xF4\x8F\xBF\xBF. Both of these are valid UTF-8 as far as I understand.

Each sequence contains the equivalent escaped UTF-16 surrogate pair ("\udbff\udfff" and "\ud800\udc00", respectively), but this escape sequence should be interpreted in the Ion text parser, not in the Rust String. In other words, it should be possible to have a Rust String that contains the characters "\ud800\udc00" (that is, a backslash followed by the letter u, etc.).

Please let me know if I'm misunderstanding something about the issue.

siler commented 4 years ago

Right I zeroed in on the wrong thing there. I still have a problem here though.

The second string in each of the utf16 surrogate tests contains two 4-digit hexadecimal Unicode code points. Because this is an equivalency test I must be able to compare this value against the others in the sexp. To do this, I have to un-escape the sequences. This requires choosing an encoding. There is no indication of an encoding for the strings, so I chose the one the rest of the parser is currently using: UTF-8.

I guess the question is: how is the parser supposed to compare string equivalencies without unescaping? If they are required to escape, why should the parser assume this string to be parsed as UTF-16 when un-escaped?

My understanding of the spec was that the encoding was left open, however it should be consistent for a give byte stream as we have to choose an encoding up front. Allowing escaped characters makes sense for hard-to-input or unprintable sequences, but not as a method of injecting a character that would otherwise be invalid in the provided context (encoding).

If the intent is to allow a mixed stream of UTF-8, 16, and 32 encodings then there must be a way to identify when it changes mid-stream. The spec indicates that the escapes are code points, not that they are tied to any specific encoding.

Thanks for taking a look!

tgregg commented 4 years ago

Right, the escapes represent code points. Once you are un-escaping the escape sequences, you are just calculating the Unicode (i.e. not UTF-8 or UTF-16 or any other encoding) code point, which is effectively just an integer that maps to a particular Unicode character. If the stream's encoding is UTF-8, then it will have to escape, e.g., UTF-16 surrogate pairs. That surrogate pair maps to a Unicode code point that could have been equivalently represented natively via UTF-8 code units.

So, what you need to be able to do is map an encoding-agnostic sequence of Unicode code points to the String type of your programming language. Hopefully this is straightforward for everything that isn't an escape sequence; for escape sequences you need replace the escape sequence with the code point that you calculate by un-escaping before constructing the String.

Once you've done that, the equivalence testing should be simple: just use the language-standard technique for comparing Strings.

There are examples of this in the string decoding logic of other Ion text parsing implementations; let me know if you have trouble finding them.

siler commented 4 years ago

Alright, we're zeroing in:

If the stream's encoding is UTF-8, then it will have to escape, e.g., UTF-16 surrogate pairs.

To properly encode a UTF-16 surrogate pair into UTF-8 I would not escape the pair individually, I would determine the code point and translate that into UTF-8.

That surrogate pair maps to a Unicode code point that could have been equivalently represented natively via UTF-8 code units.

I think I object to the two 4-digit hexadecimal Unicode code points being considered a surrogate pair. The Ion spec states that an escape is a single code point, and in the UTF-16 spec a code point is either a single 16-bit value or two adjacent 16-bit values, which match a specific format. Either of those two potential values is a single code point.

So, what you need to be able to do is map an encoding-agnostic sequence of Unicode code points to the String type of your programming language.

Ironically, that is what is causing this problem for me. If the distinct code points (i.e. one per escape) is translated directly into its Unicode, those values individually are in the low surrogate and high surrogate ranges (respectively), which are only valid when adjacent as 16-bit values in UTF-16, where they would be interpreted as a single code point in the supplementary planes.

I'm also a little confused about why this would be desired behavior, though I'm guessing it's due to the environment Ion grew up in (it's essentially leaking implementation details of UTF-16 into the string spec for Ion).

Looking at the Python implementation I see that we are not treating a 4 digit hex escape as a code point, but instead a pair of UTF-16 "code units", as Wikipedia puts it.

If this behavior is necessary, I think a bit more information on how escapes are intended to be processed should exist for the 4 digit hex escapes.

siler commented 4 years ago

It turns out the Strings and Clobs section actually addresses this and has the same concerns I do:

Ion does not specify the behavior of specifying invalid Unicode code points or surrogate code points (used only for UTF-16) using the escape sequences. It is highly recommended that Ion implementations reject such escape sequences as they are not proper Unicode as specified by the standard. To this point, consider the Ion string sequence, "\uD800\uDC00". A compliant parser may throw an exception because surrogate characters are specified outside of the context of UTF-16, accept the string as a technically invalid sequence of two Unicode code points (i.e. U+D800 and U+DC00), or interpret it as the single Unicode code point U+00010000. In this regard, the Ion string data type does not conform to the Unicode specification. A strict Unicode implementation of the Ion text should not accept such sequences.

siler commented 4 years ago

Given that context, could we either:

separate those tests into a distinct group used to test a specific optional behavior
remove those tests entirely

I don't like the idea of leaving them, as it encourages implementation of incorrect behavior.

I'd be happy to submit a PR for either.

toddjonker commented 4 years ago

Let's break this down a bit.

There's no such thing as a "4-digit hexadecimal Unicode code point". A code point is an integer, hexadecimal is an encoding, those are two independent things. A code point can be encoded in many different ways.

Every equivalence set in that file consists of a single Unicode code point, encoded several different way. The comments in the file are misleading: there's no UTF-16 or UTF-32 anywhere in there. The text file is UTF-8, full stop.

The encoding differences test the various flavors of escape sequences that the Ion text spec defines for use in string and symbol elements. The tests represent the intent that all of the escape flavors are interchangable: they all represent exactly the same content in the data model. The differences are purely syntactic, not semantic, and Ion implentations are required to treat them accordingly.

Suppose I have a UTF-8 encoded text file made up of the 12 (ASCII) characters "\U0010ffff". That file is 12 bytes long, and 12 UTF-8 code units, and 12 Unicode code points. When parsed as Ion data, that file denotes a single Ion string containing the single code point U+10FFFF. Ion's \U escape is a direct representation of code points.

For code points that can be encoded in four hexadecimal digits, falling into what Unicode calls the "Basic Multilingal Plane", Ion offers a shorter escape sequence using \u. But that doesn't work for the above string, because its sole code point is too large. In addition to the \U representation, Ion allows non-BMP code points to be represented in Ion text as an escaped surrogate pair, here "\udbff\udfff". Neither half of that string is valid on its own, because neither half individually denotes a valid (non-surrogate) code point. In other words, the inputs "\udbff" and "\udfff" are both invalid Ion due to having unmatched surrogates.

The spec could use some clarification on this subject. The intent is that when\u denotes a surrogate, it MUST have an adjacent surrogate \u with which it form a valid surrogate pair. So it's not correct to say that \u always represents a (BMP) code point, there's an exception that enables properly-paired juxtaposed surrogates to be composed into a single code point.

As you've observed, this is confusing, and Ion only allows it for compatibility with JSON, which doesn't have \U to represent larger code points.

The passage you quote from Strings and Clobs (thank you!) is outdated; the surrogate pairing behavior was poorly understood and specified until relatively recently in Ion's lifetime. It's now incorrect: "\uD800\uDC00" MUST be parsed as a single code point U+010000. I think that whole paragraph needs a careful rewrite, in addition to the general clarification on surrogate pairs.

Does this help?

siler commented 4 years ago

Yes, that helps a lot, thank you! I didn't realize that was a limitation of JSON, and without that explanation it would have felt bad to implement that behavior without sufficient justification (lifting the semantics of UTF-16 into an escaped version of itself - why on earth would we do that?).

I don't know if you looked at the table above that paragraph, but it also needs updating. It contains a pretty clear breakdown of the semantics I was describing above (including the phrase 4-digit hexadecimal Unicode code point).

I think it would be important to capture the following points when documenting this:

a \u escape is explicitly not always a valid Unicode code point
if a \u escape is found and identified as being a UTF-16 code unit in the lower surrogate range, it must be immediately followed by a \u escape with a UTF-16 code unit in the upper surrogate range. If not, the parser must error.
this exists to support conversion from JSON where the JSON document contains an escaped Unicode code point outside the BMP, represented as two subsequent \u escapes as described above. this is necessary because JSON does not have a version of \U.

amazon-ion / ion-tests

Invalid UTF-8 character numbers in stringUtf8.ion #65