RazrFalcon / xmlparser

A low-level, pull-based, zero-allocation XML 1.0 parser.
Apache License 2.0
130 stars 16 forks source link

Character Conformance Changes #25

Closed Simon-Martens closed 1 year ago

Simon-Martens commented 1 year ago

There is no need to check for Unicode surrogate code points since a Rust char is excluded from ever containing one...

The only bytes left not allowed are everything under 0x20 that is not \t \r or \n, also non-characters 0xFFFE & 0xFFFF. This gives a speed bump (consistent 10% with a 13 MB xml file on my machine), bc the character-by-character-check gets much simpler.

RazrFalcon commented 1 year ago

Good find. Thanks!

Can you add a comment explaining why we're skipping those ranges? Otherwise it would be confusing.

And I guess we cannot really test for it since it is handled by str::from_utf8, right?

Simon-Martens commented 1 year ago

Jup, afaik any string containing a single surrogate code point is not a Unicode encoded string at all. To be valid UTF-16 there must be two consecutive of them. str is not designed to handle these constraints, so it excludes these ranges overall. It's always valid UTF-8. In other words, constructing a str with an invalid code point would give a Utf8Error.

The compiler knows about this, it does not give you a "match arm missing"-error if you exclude these ranges from a match against a char.

Anyways, I've added the comment you suggested inside the trait for the method. Cheers!

RazrFalcon commented 1 year ago

Thanks!