atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Longer string in Katakana has low priority #115

Open oharato opened 7 years ago

oharato commented 7 years ago

Tested with kuromoji-core-1.0-SNAPSHOT and kuromoji-ipadic-1.0-SNAPSHOT. (build from master at 2017/3/8)

When the user dictionary is

くろも,くろも,くろも,カスタム名詞
ろ,ろ,ろ,カスタム名詞

, the string "くろもじ" is tokenized into

くろも カスタム名詞,*,*,*,*,*,*,くろも,*
じ 助動詞,*,*,*,不変化型,基本形,じ,ジ,ジ

which is fine.

When the user dictionary is

クロモ,クロモ,クロモ,カスタム名詞
ロ,ロ,ロ,カスタム名詞

, the string "クロモジ" is tokenized into

ク 名詞,一般,*,*,*,*,ク,ク,ク
ロ カスタム名詞,*,*,*,*,*,*,ロ,*
モ *,*,*,*,*,*,*,*,*
ジ *,*,*,*,*,*,*,*,*

which is not fine.

I expected below.

クロモ カスタム名詞,*,*,*,*,*,*,クロモ,*
ジ *,*,*,*,*,*,*,*,*

What should I do for the expectation?

sample code I used:

public static void main(String[] args) {
  String target = "くろもじ";
  List<String> dictionaryList = Arrays.asList("くろも,くろも,くろも,カスタム名詞", "ろ,ろ,ろ,カスタム名詞");
  String target = "クロモジ";
  List<String> dictionaryList = Arrays.asList("クロモ,クロモ,クロモ,カスタム名詞", "ロ,ロ,ロ,カスタム名詞");
  String dictionary = String.join(System.lineSeparator(), dictionaryList);
  Builder builder = new Tokenizer.Builder();
  try {
    InputStream inputStream = new ByteArrayInputStream(dictionary.getBytes("utf-8"));
    builder.userDictionary(inputStream);
  } catch (Exception e) {
  }
  Tokenizer tokenizer = builder.build();
  List<Token> tokens = tokenizer.tokenize(target);
  tokens.stream().forEach(token -> System.out.println(token.getSurface()+"\t"+token.getAllFeatures()));
}