JuliaStrings / utf8proc

a clean C library for processing UTF-8 Unicode data
http://juliastrings.github.io/utf8proc/
Other
1.06k stars 140 forks source link

benchmarking/profiling #12

Open stevengj opened 10 years ago

stevengj commented 10 years ago

It would be good to perform some benchmarking of utf8proc against ICU, and in general to perform some profiling to see if there are any easy targets for optimization.

jiahao commented 10 years ago

How about http://www.mediawiki.org/wiki/PHP_5.2_benchmarks#Unicode_normalization_benchmark

stevengj commented 10 years ago

Sounds like a decent benchmark. (Does PHP use ICU?)

stevengj commented 10 years ago

See also these Ruby benchmarks. The fastest code in that benchmark (although it didn't seem to look at utf8proc or ICU) was unf, which seems to be a wrapper around this unf_ext C++ code.

stevengj commented 10 years ago

The eprun library claims to be fast, albeit in pure Ruby.

A hopeful sign is that eprun was designed by a Unicode expert and includes benchmark data extracted from random Wikipedia pages.

stevengj commented 10 years ago

I just pushed a little benchmark program based on the eprun data files. I also pushed a corresponding benchmark of ICU (although it gives a slightly unfair advantage to ICU by preallocating a huge buffer for the output, whereas utf8proc figures out the output size dynamically).

Results on my machine:

$ cat bench.out 
Deutsch_.txt: 0.00390216
Japanese_.txt: 0.00458185
Korean_.txt: 0.00171503
Vietnamese_.txt: 0.0036274

$ cat icu.out 
Deutsch_.txt: 0.00011213
Japanese_.txt: 0.00026891
Korean_.txt: 9.649e-05
Vietnamese_.txt: 0.00151726

So, unless I am messing something up, ICU is significantly faster, albeit with a far more painful API.

Just to make sure that ICU is not "cheating" by doing some kind of caching that helps it for repeated normalization of the same string (the above benchmark loops 100x), I tried normalizing a single long string formed by concatenating the above files a few dozen times, and got 0.679s for utf8proc and 0.104s for ICU.

(Compiling with -O3 instead of -O2 makes only a slight (< 5%) difference.)

Would be nice to also benchmark against GNU libunistring and perhaps unf_ext from above.

stevengj commented 10 years ago

Note that eprun, which basically makes clever use of regular expressions to get decent performance in Perl or Ruby, is (unsurprisingly) significantly slower than utf8proc on my machine (though not terrible). Its times in seconds for NFKC from Perl eprun are:

Deutsch_.txt: 0.0102
Japanese_.txt: 0.0127
Korean_.txt: 0.0062
Vietnamese_.txt: 0.0089

I couldn't get the Ruby eprun working on my machine, but I don't really understand Ruby.

stevengj commented 10 years ago

In order to figure out the correct buffer size, utf8proc_map has to perform the canonical decomposition twice, so we have a factor of 2 penalty from that compared to the (somewhat artificial) way I am calling ICU. But this does not correspond to a factor of 2 overall, because decomposition is only part of the process. If I hack out this doubled decomposition, the benchmark numbers improve slightly to:

Deutsch_.txt: 0.00256355
Japanese_.txt: 0.00316227
Korean_.txt: 0.00118366
Vietnamese_.txt: 0.00255491
stevengj commented 10 years ago

From gprof, it looks like 40% of the time is spent in utf8proc_decompose_char, 32% in utf8proc_iterate, 14% in utf8proc_decompose, and 11% in utf8proc_encode_char.

stevengj commented 10 years ago

GNU libunistring (benchmark added in a39c1a6ea287e10c72c8b5d3013d232b7f85af3c), which has a similar API (operates directly on UTF8 data and does not require the output buffer to be preallocated) looks very comparable to utf8proc:

Deutsch_.txt: 0.00335499
Japanese_.txt: 0.00381766
Korean_.txt: 0.0018932
Vietnamese_.txt: 0.00256382
stevengj commented 9 years ago

We could make utf8proc_iterate faster if we assumed that the string was valid UTF-8, and hence could remove the checks. This is the case in Julia, because UTF-8 strings are validated when they are created.

One possibility would be to have an additional flag to assume valid input, in which case a codepath with fewer checks is called. (May be somewhat annoying to implement without a bunch of cut-and-paste, though some preprocessor hacks could be used: e.g. have a file with the decompose and iterate functions, and #include it twice with different #defines to get two versions of the functions.)

ScottPJones commented 9 years ago

This is the case in Julia, because UTF-8 strings are validated when they are created.

@stevengj I agree that having valid UTF-8 input could be used to make utf8proc_iterate faster, unfortunately, that's not correct currently in Julia (although @JeffBezanson [I think] and I would like to make that true). About libunistring, pretty please avoid anything with GPL (except in packages), it makes Julia useless (or even dangerous) for people like me, trying to use Julia in a commericial project. For really frequent operations, I'll see if I can get this up to ICU speeds...

tkelman commented 8 years ago

The author of utf8rewind pings me once in a while on reddit to let me know he's been adding new features, so this would be another thing to compare against eventually: https://bitbucket.org/knight666/utf8rewind/overview

It's MIT licensed and pretty small like utf8proc is, so we could borrow anything that turned out to be worth using.

xhochy commented 4 years ago

In the Apache Arrow project, we also evaluating using utf8proc and are running some benchmarks, see https://github.com/apache/arrow/pull/7449#issuecomment-645276441 Currently unilib seems to perform better for us.

stevengj commented 4 years ago

One could probably speed up upper/lowercase conversions, e.g. by adding a specialized table just for this, but it's not clear from the issue whether that functionality is actually performance critical or just a test case?

stevengj commented 3 years ago

This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/

However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.

sgllama commented 1 year ago

This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/

However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.

This may be the thesis author: https://www.linkedin.com/in/jpdurham