crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.22k stars 1.61k forks source link

Error handling for grapheme boundaries #11627

Open straight-shoota opened 2 years ago

straight-shoota commented 2 years ago

Given a string consisting of an invalid byte, followed by a character that forms a Grapheme cluster with the Unicode replacement character, such as "\xFF\u200D": Does it comprise a single grapheme (\uFFFD\u200D) or two graphemes (\uFFFD and \u200D)? (Current behaviour is the former.)

When a string contains an invalid byte sequence that is not a proper UTF-8 encoded code point, Char::Reader enters an error state and yields the Unicode replacement character (U+FFFD).

The sequence U+FFFD U+200D is recognized as a grapheme cluster. An invalid byte in a string (such as 0xFF) produces the replacement character and, if followed by U+200D, they are recognized as a grapheme cluster U+FFFD U+200D.

I'm not sure this is correct. Here, U+FFFD is not an actual character, it's just a stand-in for a non-Unicode byte value. It should not be part of any Grapheme cluster.

On the other hand, the replacement character is the de-facto realization of any invalid byte value. This is evidenced by joining the graphemes to a string and splitting them again into graphemes. At this point there's no way to distinguish a literal U+FFFD from an error replacement. So "\xFF\u200D".graphemes.join("").graphemes should definitely result in a single grapheme.

I have a patch to change this at https://github.com/straight-shoota/crystal/tree/fix/grapheme-invalid-char, but I'm really not sure whether it makes sense to try to fix this. This is really just edge cases, so it doesn't have any high impact whichever we choose. However, we should certainly add specs either way.

naqvis commented 2 years ago

Does it comprise a single grapheme (\uFFFD\u200D) or two graphemes (\uFFFD and \u200D)?

Current behavior is correct. Swift and ICU treat them in similar fashion

let word = "\u{FF}\u{200D}"
print("word \(word) has \(word.count) chars.") // word ÿ‍ has 1 chars.
for codeUnit in word.utf8 {
    print("\(codeUnit) ", terminator: "")
}
// 195 191 226 128 141

ICU

$ printf '\uFF\u200D' | uconv -x any-name
\N{LATIN SMALL LETTER Y WITH DIAERESIS}\N{ZERO WIDTH JOINER}

$ printf '\uFFFD\u200D' | uconv -x any-name
\N{REPLACEMENT CHARACTER}\N{ZERO WIDTH JOINER}

When a string contains an invalid byte sequence that is not a proper UTF-8 encoded code point, Char::Reader enters an error state and yields the Unicode replacement character (U+FFFD).

I will question current Char::Reader on treating this as invalid and tripping with REPLACEMENT CHAR

HertzDevil commented 2 years ago

If every other public, codepoint-based API of String substitutes invalid byte sequences with U+FFFD, then so should the grapheme API because it is also based on codepoints.

straight-shoota commented 2 years ago

@naqvis It seems you might have misunderstood the situation. This is not about the code point U+00FF, it's about a byte value of 0xFF in a string, which is not a valid UTF-8 encoding.