Tokenizing text in Hiragana character set

atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

Apache License 2.0

950 stars 131 forks source link

Tokenizing text in Hiragana character set #105

Open mhko opened 8 years ago

mhko commented 8 years ago

Tokenizing a sentence "寿司が美味しい。" produces the following tokens: <寿司>,<が>,<美味しい>,<。>

Tokenizing the same sentence written only in hiragana character exhibits identical behavior which is great. <すし>,<が>,<おいしい>,<。>

However, for some other words, tokenization behavior depends on the input character set.

For example, for "大学生":

The word is correctly tokenized into <大学生> if the input character set was Kanji.

When the input character set was Hiragana, "だいがくせい", the same word produces the following tokens. <だい>,<が>,<くせ>,<い>.

Is this a known issue? Is there any configuration I could tweak so that the two cases behaves the same way regardless of the input character set?

Thanks in advance for your help!

akkikiki commented 8 years ago

This is a common issue because when the tokenization model is trained, it looks at the surface feature (and POS, base form, conjugation form, etc.) rather than its reading. The example of だいがくせい is quite easy, but in general, a sentence with all in hiragana or any single script type is hard to tokenize because it increases the ambiguity of the segmentation.

One way to handle this is to add だいがくせい to the user dictionary. E.g.

だいがくせい,1285,1285,some integer,名詞,一般,*,*,*,*,だいがくせい,ダイガクセイ,ダイガクセイ

cmoen commented 8 years ago

Nate, could you share more information on your use-case and what you would like to accomplish?

mhko commented 8 years ago

Reading extraction is such an awesome feature. :) However, the feature works only if <大学生> and <だいがくせい> are tokenized the same way, doesn't it? For example, I'd like be able to find a document that contains the word in Kanji <大学生> using a search query with the same word in Hiragana <だいがくせい>. I am finding cases that such searches do not work even with the reading normalization due to the difference in tokenizations.

Thanks for the help!

akkikiki commented 8 years ago

Let's try not to be confused about the "feature" used for the machine learning models and the "feature" for Kuromoji. The word "feature" has a special meaning in the context of machine learning, so I prefer no to use it in other way in this context.

I do not have any additional comments other than recommending to use a the user dictionary feature to make the tokenization consistent. Christian should have some additional comments.

mhko commented 8 years ago

Sorry if my question wasn't clear. Let me know if you need me to clarify anything.

@cmoen Does the use case sound reasonable to you?