Closed Simon-Martens closed 1 year ago
Good find. Thanks!
Can you add a comment explaining why we're skipping those ranges? Otherwise it would be confusing.
And I guess we cannot really test for it since it is handled by str::from_utf8
, right?
Jup, afaik any string containing a single surrogate code point is not a Unicode encoded string at all. To be valid UTF-16 there must be two consecutive of them. str is not designed to handle these constraints, so it excludes these ranges overall. It's always valid UTF-8. In other words, constructing a str with an invalid code point would give a Utf8Error.
The compiler knows about this, it does not give you a "match arm missing"-error if you exclude these ranges from a match against a char.
Anyways, I've added the comment you suggested inside the trait for the method. Cheers!
Thanks!
There is no need to check for Unicode surrogate code points since a Rust char is excluded from ever containing one...
The only bytes left not allowed are everything under 0x20 that is not \t \r or \n, also non-characters 0xFFFE & 0xFFFF. This gives a speed bump (consistent 10% with a 13 MB xml file on my machine), bc the character-by-character-check gets much simpler.