atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Compound word with nakaguro in it #104

Closed mhko closed 6 years ago

mhko commented 8 years ago

Thanks for the library.

I was testing compound words with nakaguro character in them and noticed that a compound word 'コカ・コーラ' is tokenized to a single term <コカ・コーラ> in Search mode whereas another such word 'アイス・キューブ' tokenizes to its components <アイス>, <キューブ>. Is the former produces a single token because it's a trademark or could this be a bug? Ultimately, I'd like to find documents that contain <コカ・コーラ> using a search term <コーラ>.

Thanks in advance for your help!

akkikiki commented 8 years ago

Which dictionary are you using? If it is the default one (IPADic), this is not a bug because there is a following dictionary entry:

コカ・コーラ,1288,1288,3891,名詞,固有名詞,一般,*,*,*,コカ・コーラ,コカコーラ,コカコーラ

Search mode works mainly on compound words that is not in the dictionary. In fact, アイスキューブ is not in IPADic. Again, you customize the tokenizer by adding words to dictionary if you want to do some quick fix.

cmoen commented 8 years ago

There's also a builder option for splitting unknown words on nakaguro that can be used as follows:

Tokenizer tokenizer = new Tokenizer.Builder()
    .isSplitOnNakaguro(true)
    .mode(TokenizerBase.Mode.SEARCH)
    .build();

It only works on unknown words, but in combination with search mode, perhaps it makes more sense that we split on nakaguro in all cases.

In your case, as pointed out by Fujinuma-san, "コカ・コーラ" is a known word.