Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method [LUCENE-3935]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.62k stars 1.02k forks source link

Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method [LUCENE-3935] #5008

Open asfimport opened 12 years ago

asfimport commented 12 years ago

I've been profiling Kuromoji, and not very surprisingly, method ConnectionCosts.get(int forwardId, int backwardId) that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected.

This method is currently backed by a short[][]. This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension. (The data is matrix.def in MeCab-IPADIC.)

We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache.

I think this will be a nice optimization. Working on it...

Migrated from LUCENE-3935 by Christian Moen (@cmoen), updated Mar 30 2012 Attachments: LUCENE-3935.patch

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree, this should also help on RAM? currently this is 1316 short[1316]'s i think?

I feel like there is a TODO in here somewhere about this: especially since the connectioncosts.dat we write is "1 dimensional", we just write forwardSize and backwardSize first and force it to a 2D array in RAM. But it could just as well make a single array of forwardSize*backwardSize...

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Ah, this brings back a small project that kind of lies in a dormant state for some time – I've written an annotation processor that generated classes for handling arrays of struct-like types (objects with fields only), including flattened multi-dimensional arrays. The code is on a branch here –

https://github.com/carrotsearch/hppc/blob/structs/hppc-examples/src/main/java/com/carrotsearch/hppc/examples/BattleshipsCell.java

But it's been a while, I need to get back to it, it may be useful.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks.

Robert has done a great job making the binary version of matrix.def tiny with fancy encoding of data. Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on ConnectionCosts.get() after the change. The method is called very, very frequently on indexing, and it's total CPU contribution is \~7-8% after the change, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the change, hoping that all the connection costs would be in very local cache, but this number looks correct to me. Would be great to get your feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

+1, Robert and I already discussed about making one array out of it. It was at the time when I rewrote the hairy targetMap to be much more memory effective and not a huuuuuuge array of arrays with 1 entry each :-)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

+1. If this is called very frequently and you can affort storing ints instead of shorts then an int[] will have better alignment properties (and will not require extending to an int). May or may not play a difference depending on architecture (cpu cache sizes also matter here).

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks, Mike, Uwe and Dawid.

It's a good idea to do testing using int – thanks for that. I did this hastily last night and results suggested that there wasn't a lot to be gained on Mac OS X, but I will look more into this and do a better test.

Kuromoji has a low memory footprint (uses FST instead of double-array trie, does Viterbi in a streaming fashion, etc.), which is a nice characteristic I'd like to keep. Hence, I'm reluctant to trade 3MB of memory unless int really gives us a lot in terms of additional speed. (Kuromoji currently segments \~2.5-3MB/sec per CPU core on my system.)

I'll do some additional testing, have a think, but I'm likely to commit the short version in the attached patch tomorrow.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I did this hastily last night and results suggested that there wasn't a lot to be gained on Mac OS X

I agree it may not be noticeable because there are so many factors kicking in here (smaller structure - better cpu cache utilization vs. larger structure - potentially faster access to each value but potential cache misses).

Makes sense to keep short[] in place, ignore my comment.