Open straight-shoota opened 2 years ago
Does it comprise a single grapheme (
\uFFFD\u200D
) or two graphemes (\uFFFD
and\u200D
)?
Current behavior is correct. Swift and ICU treat them in similar fashion
let word = "\u{FF}\u{200D}"
print("word \(word) has \(word.count) chars.") // word ÿ has 1 chars.
for codeUnit in word.utf8 {
print("\(codeUnit) ", terminator: "")
}
// 195 191 226 128 141
ICU
$ printf '\uFF\u200D' | uconv -x any-name
\N{LATIN SMALL LETTER Y WITH DIAERESIS}\N{ZERO WIDTH JOINER}
$ printf '\uFFFD\u200D' | uconv -x any-name
\N{REPLACEMENT CHARACTER}\N{ZERO WIDTH JOINER}
When a string contains an invalid byte sequence that is not a proper UTF-8 encoded code point,
Char::Reader
enters an error state and yields the Unicode replacement character (U+FFFD
).
I will question current Char::Reader
on treating this as invalid and tripping with REPLACEMENT CHAR
If every other public, codepoint-based API of String
substitutes invalid byte sequences with U+FFFD, then so should the grapheme API because it is also based on codepoints.
@naqvis It seems you might have misunderstood the situation. This is not about the code point U+00FF
, it's about a byte value of 0xFF
in a string, which is not a valid UTF-8 encoding.
Given a string consisting of an invalid byte, followed by a character that forms a Grapheme cluster with the Unicode replacement character, such as
"\xFF\u200D"
: Does it comprise a single grapheme (\uFFFD\u200D
) or two graphemes (\uFFFD
and\u200D
)? (Current behaviour is the former.)When a string contains an invalid byte sequence that is not a proper UTF-8 encoded code point,
Char::Reader
enters an error state and yields the Unicode replacement character (U+FFFD
).The sequence
U+FFFD U+200D
is recognized as a grapheme cluster. An invalid byte in a string (such as0xFF
) produces the replacement character and, if followed byU+200D
, they are recognized as a grapheme clusterU+FFFD U+200D
.I'm not sure this is correct. Here,
U+FFFD
is not an actual character, it's just a stand-in for a non-Unicode byte value. It should not be part of any Grapheme cluster.On the other hand, the replacement character is the de-facto realization of any invalid byte value. This is evidenced by joining the graphemes to a string and splitting them again into graphemes. At this point there's no way to distinguish a literal
U+FFFD
from an error replacement. So"\xFF\u200D".graphemes.join("").graphemes
should definitely result in a single grapheme.I have a patch to change this at https://github.com/straight-shoota/crystal/tree/fix/grapheme-invalid-char, but I'm really not sure whether it makes sense to try to fix this. This is really just edge cases, so it doesn't have any high impact whichever we choose. However, we should certainly add specs either way.