Open gfoidl opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
We looked at this back in 2018 when the original blog post was published. The ultimate result was that our in-box decoder is faster for real-world payloads and scenarios. See https://github.com/dotnet/corefxlab/issues/1831 for some discussion on this.
Tagging subscribers to this area: @tarekgh, @krwq See info in area-owners.md if you want to be subscribed.
Ah, good to know. It's basically
UTF8 doesn't consist of randomly distributed characters. If you see a character from a certain character set, the next several characters (excluding whitespace and punctuation) are almost certainly going to be in that same range
(from https://github.com/dotnet/corefxlab/issues/1831#issuecomment-372129177)
As you've mentioned the original blog-post, that's about a state-machine. Ridiculously fast unicode (UTF-8) validation is the "new" blog post (from yesterday) that comes along the paper linked above, which is about a (vectorized) lookup.
I pinged the team offline about this. After running benchmarks, our in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic. The Lemire logic likely exceeds the performance of the in-box implementation when validating CJK text. That's a potential area of improvement for us.
Note: I have to couch this in non-absolutes, as the Lemire logic is pure validation ("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?"). So the comparison is a bit rough.
in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic.
🚀 Nice.
Thanks for the update.
In the paper, he benchmarks on JSON and HTML which should be mostly-Latin. He gives a 10x performance advantage over other methods. What could explain that his result is so wildly different from the result documented here in this issue tracker?
The inbox is not just validation:
("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?")
Lemire, et.al. published an intersting paper: Validating UTF-8 in less than one instruction per byte. It's something we should have a look at.
/cc: @GrabYourPitchforks