atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

tokenize 一人(ひとり,hitori)will be seperate as 一(いち,ichi) 人(ひと,hito) #125

Open andy840119 opened 6 years ago

andy840119 commented 6 years ago

This project is great and useful for me : ) but i have a little question. . I'am not sure if it should be seem as a bug or not. but some words like 一人(ひとり,hitori)will be separate as two words 一(いち,ichi) 人(ひと,hito)

akkikiki commented 6 years ago

Hi,

I recommend looking at the output of the "Viterbi" option available at https://www.atilika.com/en/kuromoji/ to see what's going on. It seems that for IPADic (default dictionary) there is a connection weight that highly values the connection between 数 and 接尾 (i.e., regarding it as a number + 人). If you look at the result using UniDic, it outputs "一人" (ひとり) so the naive solution is to simply switch using UniDic.