atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
953 stars 131 forks source link

Optimization opportunity in the fst usage. #130

Open fulmicoton opened 5 years ago

fulmicoton commented 5 years ago

Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.

In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst). The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.

The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.

fulmicoton commented 5 years ago

After further investigation...

The largest optimization opportunity is probably isKanjiOnly. (40% is spent in this function and it is possible to bring it down to 0%).

The useless call to substring may also be wasting CPU Time (> 5%). It is hard to measure accurately however as they have hidden impact as well. (it does copy the original string since Java 7.0)

akkikiki commented 4 years ago

@cmoen Might be of your interest. I do not know the current Lucene's implementation on it, but it is doable by e.g., having a separate function in FST.java that returns the indices of all prefixes which exist in the dictionary.