elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.71k stars 24.4k forks source link

allow using any of Kuromoji's included dictionaries #57473

Open crclark opened 4 years ago

crclark commented 4 years ago

The Kuromoji project supports several dictionaries. kuromoji_tokenizer currently defaults to IPADic, and only allows us to append to it. The problem with IPADic is that it hasn't been updated since 2007, and there is reason to believe that UniDic is a better choice nowadays, since it is still being updated. Furthermore, in my experiments, UniDic does a better job tokenizing words in the domain I'm working in. While kuromoji_tokenizer does allow me to specify a custom dictionary CSV file, I don't want to create and maintain my own CSV dictionary and worry about deploying it; I'd prefer to be able to use an existing, well-understood dictionary.

It would be great to be able to specify {"base_dictionary": ["ipadic" | "jumandic" | "unidic" | etc.]} in the configuration of the kuromoji_tokenizer.

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Analysis)

tatsuya commented 3 years ago

+1

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)