utf8 parsing performance

alacritty / vte

Parser for virtual terminal emulators

https://docs.rs/vte/

Apache License 2.0

242 stars 56 forks source link

utf8 parsing performance #4

Open ConnyOnny opened 7 years ago

ConnyOnny commented 7 years ago

Hi, I was eager to benchmark your table-based utf8 parsing approach against the standard library implementation, so I did: https://github.com/ConnyOnny/utf8perf

If my testing setup is not wrong (see main.rs) it seems branching is not everything.

jwilm commented 7 years ago

Thanks for putting this together! I've been wanting to do some benchmark work.

There were a few problems with your test setup. I opened a PR. That said, the results aren't much better, but at least they are correct!

Read 21078000 bytes.
Parser "tbl" needed a median 0.055256400 seconds to parse 11431500 characters.
Parser "std" needed a median 0.029445756 seconds to parse 11431500 characters.

Going to mark this as a bug because we should be able to be std easily.

carl-erwin commented 7 years ago

Hi, some years ago I implemented an utf8 decoder with the same table, and used Björn Höhrmann's article as a reference http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for benchmarking. In his version the state/mask table is more compact than the 8*256 bytes used by utf8parse and thus more cache friendly.

jwilm commented 7 years ago

I've done some minimal optimization effort in #8. When I've got a bit more time, I plan to look into Björn Höhrmann's article mentioned by @carl-erwin to see if we can do better.

As to why the std parser does so much better, this seems due to optimizations available when it's possible to view multiple bytes at once.

luser commented 6 years ago

You might also be interested in encoding_rs which is currently shipping in Firefox.