Open fulmicoton opened 5 years ago
After further investigation...
The largest optimization opportunity is probably isKanjiOnly
. (40% is spent in this function and it is possible to bring it down to 0%).
The useless call to substring may also be wasting CPU Time (> 5%). It is hard to measure accurately however as they have hidden impact as well. (it does copy the original string since Java 7.0)
Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.
In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst). The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.
The headroom is valuable, but not massive. Around 15% of the time is spent in
Fst.lookup
. One can hope to cut this bit in half.