atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Very odd tokenization of a sentence #82

Closed kallewoof closed 8 years ago

kallewoof commented 8 years ago

Tokenizing "色々やらなきゃならんことがたくさんあるんだ" in the command line version of Kuromoji (using ipadic) results in

色々 やら なき ゃならんことがたくさんあるんだ

The same output is observed in the online demo available at http://www.atilika.org/

kallewoof commented 8 years ago

Tried unidic-neologd dictionary, and it parses the sentence fine. I guess ipadic is a bit volatile..?

cmoen commented 8 years ago

This isn't really a bug, but rather a limitation of IPADIC to deal with colloquial language. Try a UniDic-variant if you need to analyze colloquial language.