Unidic Tokenization on Romaji Words

Tested with version 0.9.0.

I know this is for Japanese, but it would be nice if some romaji words were tokenized consistently.

The string "hello golf2" is tokenized into:

hello
golf
2

which is fine. But when I tokenize "golf2 hello", I get using com.atilika.kuromoji.unidic.Tokenizer (also unidic.kanaaccent, but not the other tokenizers):

g
o
l
f
2
hello

It would be nice, if the second case were like the first. In the meantime, I might handle this with a user-dictionary.

atilika / kuromoji

Unidic Tokenization on Romaji Words #103