The Kuromoji project supports several dictionaries. kuromoji_tokenizer currently defaults to IPADic, and only allows us to append to it. The problem with IPADic is that it hasn't been updated since 2007, and there is reason to believe that UniDic is a better choice nowadays, since it is still being updated. Furthermore, in my experiments, UniDic does a better job tokenizing words in the domain I'm working in. While kuromoji_tokenizer does allow me to specify a custom dictionary CSV file, I don't want to create and maintain my own CSV dictionary and worry about deploying it; I'd prefer to be able to use an existing, well-understood dictionary.
It would be great to be able to specify {"base_dictionary": ["ipadic" | "jumandic" | "unidic" | etc.]} in the configuration of the kuromoji_tokenizer.
The Kuromoji project supports several dictionaries.
kuromoji_tokenizer
currently defaults to IPADic, and only allows us to append to it. The problem with IPADic is that it hasn't been updated since 2007, and there is reason to believe that UniDic is a better choice nowadays, since it is still being updated. Furthermore, in my experiments, UniDic does a better job tokenizing words in the domain I'm working in. Whilekuromoji_tokenizer
does allow me to specify a custom dictionary CSV file, I don't want to create and maintain my own CSV dictionary and worry about deploying it; I'd prefer to be able to use an existing, well-understood dictionary.It would be great to be able to specify
{"base_dictionary": ["ipadic" | "jumandic" | "unidic" | etc.]}
in the configuration of thekuromoji_tokenizer
.