atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Unidic Tokenization on Romaji Words #103

Open tobias-khs opened 8 years ago

tobias-khs commented 8 years ago

Tested with version 0.9.0.

I know this is for Japanese, but it would be nice if some romaji words were tokenized consistently.

The string "hello golf2" is tokenized into:

which is fine. But when I tokenize "golf2 hello", I get using com.atilika.kuromoji.unidic.Tokenizer (also unidic.kanaaccent, but not the other tokenizers):

It would be nice, if the second case were like the first. In the meantime, I might handle this with a user-dictionary.