Validating UTF-8 in less than one instruction per byte

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

14.57k stars 4.55k forks source link

Validating UTF-8 in less than one instruction per byte #43688

Open gfoidl opened 3 years ago

gfoidl commented 3 years ago

Lemire, et.al. published an intersting paper: Validating UTF-8 in less than one instruction per byte. It's something we should have a look at.

/cc: @GrabYourPitchforks

Dotnet-GitSync-Bot commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

GrabYourPitchforks commented 3 years ago

We looked at this back in 2018 when the original blog post was published. The ultimate result was that our in-box decoder is faster for real-world payloads and scenarios. See https://github.com/dotnet/corefxlab/issues/1831 for some discussion on this.

ghost commented 3 years ago

Tagging subscribers to this area: @tarekgh, @krwq See info in area-owners.md if you want to be subscribed.

gfoidl commented 3 years ago

Ah, good to know. It's basically

UTF8 doesn't consist of randomly distributed characters. If you see a character from a certain character set, the next several characters (excluding whitespace and punctuation) are almost certainly going to be in that same range

(from https://github.com/dotnet/corefxlab/issues/1831#issuecomment-372129177)

As you've mentioned the original blog-post, that's about a state-machine. Ridiculously fast unicode (UTF-8) validation is the "new" blog post (from yesterday) that comes along the paper linked above, which is about a (vectorized) lookup.

GrabYourPitchforks commented 3 years ago

I pinged the team offline about this. After running benchmarks, our in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic. The Lemire logic likely exceeds the performance of the in-box implementation when validating CJK text. That's a potential area of improvement for us.

Note: I have to couch this in non-absolutes, as the Lemire logic is pure validation ("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?"). So the comparison is a bit rough.

gfoidl commented 3 years ago

in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic.

🚀 Nice.

Thanks for the update.

GSPP commented 3 years ago

In the paper, he benchmarks on JSON and HTML which should be mostly-Latin. He gives a 10x performance advantage over other methods. What could explain that his result is so wildly different from the result documented here in this issue tracker?

gfoidl commented 3 years ago

The inbox is not just validation:

("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?")