atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Unidic design flaw #118

Open wareya opened 7 years ago

wareya commented 7 years ago

Unidic's lex data doesn't have enough information for the viterbi algorithm to distinguish words with the same readings and same word types in context. So お父さん is always interpreted as お・ちち・さん, instead of お・とう・さん like it should be.

父,5142,5142,3860,名詞,普通名詞,一般,*,*,*,チチ,父,父,チチ,父,チチ,和,*,*,*,*

父,5142,5142,4656,名詞,普通名詞,一般,*,*,*,トウ,父,父,トー,父,トー,和,*,*,*,*

They're otherwise identical, but the ちち reading has a lower cost, so it always wins when the word is in the kanji form. Basically, unidic's segment features don't have a way to distinguish these. It's easy to write a script that looks for segments that are identical in surface form and feature list and see what problematic matches there are.

This is basically impossible to fix on kuromoji's side without adding a list of segments that act differently than their features indicate, which would be ridiculous. On the other hand, one of kuromoji's implicit goals is to not be worse than other morphological analyzers, so this is a problem worth posting about.

I added a bunch of お父 etc. entries to my user dictionary to gloss over this problem by prepending the お・御. (for unidic-kanaaccent STAGING)

おとう,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,おとう,オトー,おとう,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
お父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,お父,オトー,お父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
御父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,御父,オトー,御父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*

おかあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,おかあ,オカー,おかあ,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
お母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,お母,オカー,お母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
御母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,御母,オカー,御母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*

おにい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,おにい,オニー,おにい,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
お兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,お兄,オニー,お兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
御兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,御兄,オニー,御兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*

おねえ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,おねえ,オネー,おねえ,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
お姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,お姉,オネー,お姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,御姉,オネー,御姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

お姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,お姐,オネー,お姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,御姐,オネー,御姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

おばあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,おばあ,オバー,おばあ,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
お婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,お婆,オバー,お婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
御婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,御婆,オバー,御婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*

おじい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,おじい,オジー,おじい,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
お爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,お爺,オジー,お爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
御爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,御爺,オジー,御爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*

(weights are for illustration, I think they're too high to catch in all intended cases)

fasiha commented 6 years ago

In your particular example, if I ask for the two best results, I get とう instead of ちち: here I'm use MeCab/Unidic but should be the same with Kuromoji:

➜ echo お父さん | mecab -d /usr/local/lib/mecab/dic/unidic -N2
お   オ   オ   御   接頭辞
父   チチ  チチ  父   名詞-普通名詞-一般
さん  サン  サン  さん  接尾辞-名詞的-一般
EOS
お   オ   オ   御   接頭辞
父   トー  トウ  父   名詞-普通名詞-一般
さん  サン  サン  さん  接尾辞-名詞的-一般
EOS

Isn't this one of the big reasons why these parsers give you N-best results, N>1?

wareya commented 6 years ago

unidic 2.3.0 solves this problem for the specific case of this set of words

cmoen commented 6 years ago

Could you indicate more precisely what you mean by "unidic 2.3.0"? Do you have a URL you can share? Thanks.

wareya commented 6 years ago

http://unidic.ninjal.ac.jp/

2018/03/29 現代語用UniDicのv2.3.0(beta版)を公開 alpha版は4月上旬にフルパッケージで公開します。

It's only listed on the "back-number" page:

http://unidic.ninjal.ac.jp/back_number