Open stevengj opened 10 years ago
Sounds like a decent benchmark. (Does PHP use ICU?)
See also these Ruby benchmarks. The fastest code in that benchmark (although it didn't seem to look at utf8proc or ICU) was unf, which seems to be a wrapper around this unf_ext C++ code.
The eprun library claims to be fast, albeit in pure Ruby.
A hopeful sign is that eprun was designed by a Unicode expert and includes benchmark data extracted from random Wikipedia pages.
I just pushed a little benchmark program based on the eprun data files. I also pushed a corresponding benchmark of ICU (although it gives a slightly unfair advantage to ICU by preallocating a huge buffer for the output, whereas utf8proc figures out the output size dynamically).
Results on my machine:
$ cat bench.out
Deutsch_.txt: 0.00390216
Japanese_.txt: 0.00458185
Korean_.txt: 0.00171503
Vietnamese_.txt: 0.0036274
$ cat icu.out
Deutsch_.txt: 0.00011213
Japanese_.txt: 0.00026891
Korean_.txt: 9.649e-05
Vietnamese_.txt: 0.00151726
So, unless I am messing something up, ICU is significantly faster, albeit with a far more painful API.
Just to make sure that ICU is not "cheating" by doing some kind of caching that helps it for repeated normalization of the same string (the above benchmark loops 100x), I tried normalizing a single long string formed by concatenating the above files a few dozen times, and got 0.679s for utf8proc and 0.104s for ICU.
(Compiling with -O3
instead of -O2
makes only a slight (< 5%) difference.)
Would be nice to also benchmark against GNU libunistring and perhaps unf_ext from above.
Note that eprun
, which basically makes clever use of regular expressions to get decent performance in Perl or Ruby, is (unsurprisingly) significantly slower than utf8proc on my machine (though not terrible). Its times in seconds for NFKC from Perl eprun are:
Deutsch_.txt: 0.0102
Japanese_.txt: 0.0127
Korean_.txt: 0.0062
Vietnamese_.txt: 0.0089
I couldn't get the Ruby eprun working on my machine, but I don't really understand Ruby.
In order to figure out the correct buffer size, utf8proc_map has to perform the canonical decomposition twice, so we have a factor of 2 penalty from that compared to the (somewhat artificial) way I am calling ICU. But this does not correspond to a factor of 2 overall, because decomposition is only part of the process. If I hack out this doubled decomposition, the benchmark numbers improve slightly to:
Deutsch_.txt: 0.00256355
Japanese_.txt: 0.00316227
Korean_.txt: 0.00118366
Vietnamese_.txt: 0.00255491
From gprof
, it looks like 40% of the time is spent in utf8proc_decompose_char
, 32% in utf8proc_iterate
, 14% in utf8proc_decompose
, and 11% in utf8proc_encode_char
.
GNU libunistring (benchmark added in a39c1a6ea287e10c72c8b5d3013d232b7f85af3c), which has a similar API (operates directly on UTF8 data and does not require the output buffer to be preallocated) looks very comparable to utf8proc:
Deutsch_.txt: 0.00335499
Japanese_.txt: 0.00381766
Korean_.txt: 0.0018932
Vietnamese_.txt: 0.00256382
We could make utf8proc_iterate
faster if we assumed that the string was valid UTF-8, and hence could remove the checks. This is the case in Julia, because UTF-8 strings are validated when they are created.
One possibility would be to have an additional flag to assume valid input, in which case a codepath with fewer checks is called. (May be somewhat annoying to implement without a bunch of cut-and-paste, though some preprocessor hacks could be used: e.g. have a file with the decompose
and iterate
functions, and #include
it twice with different #define
s to get two versions of the functions.)
This is the case in Julia, because UTF-8 strings are validated when they are created.
@stevengj I agree that having valid UTF-8 input could be used to make utf8proc_iterate
faster, unfortunately, that's not correct currently in Julia (although @JeffBezanson [I think] and I would like to make that true). About libunistring
, pretty please avoid anything with GPL (except in packages), it makes Julia useless (or even dangerous) for people like me, trying to use Julia in a commericial project. For really frequent operations, I'll see if I can get this up to ICU speeds...
The author of utf8rewind pings me once in a while on reddit to let me know he's been adding new features, so this would be another thing to compare against eventually: https://bitbucket.org/knight666/utf8rewind/overview
It's MIT licensed and pretty small like utf8proc is, so we could borrow anything that turned out to be worth using.
In the Apache Arrow project, we also evaluating using utf8proc
and are running some benchmarks, see https://github.com/apache/arrow/pull/7449#issuecomment-645276441 Currently unilib
seems to perform better for us.
One could probably speed up upper/lowercase conversions, e.g. by adding a specialized table just for this, but it's not clear from the issue whether that functionality is actually performance critical or just a test case?
This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/
However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.
This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/
However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.
This may be the thesis author: https://www.linkedin.com/in/jpdurham
It would be good to perform some benchmarking of utf8proc against ICU, and in general to perform some profiling to see if there are any easy targets for optimization.