Handle punctuation characters in KoreanTokenizer [LUCENE-8977]

asfimport commented 4 years ago

As we discussed on #10009, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈.... => [사이즈] [.] [...]) But KoreanTokenizer doesn't divide when first character is punctuation. (...사이즈 => [...] [사이즈])

It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean)

Case #1	Case #2
Input : "...사이즈..."	Input : "...4......4사이즈"
Result : [...] [사이즈] [.] [..]	Result : [...] [4] [.] [.....] [4] [사이즈]

From what I checked, Nori has a punctuation characters(like . ,) in the dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese)

Case #1	Case #2
Input : "...サイズ..."	Input : "...4......4サイズ"
Result : [...] [サイズ] [...]	Result : [...] [4] [......] [4] [サイズ]

There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it.

Migrated from LUCENE-8977 by Namgyu Kim (@danmuzi), updated Sep 18 2019

asfimport commented 4 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?

asfimport commented 4 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

Sorry for late reply. @jimczi :(

First, I'll modify this issue from Bug to Improvement because it is ambiguous to see it as a bug.

I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?

As you said, the biggest purpose is KoreanNumberFilter. However, users can simply use discardPunctuation option of KoreanTokenizer. (not use KoreanNumberFilter)

Analyzer myAnalyzer = new Analyzer() {
  `@Override`
  protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer tokenizer = new KoreanTokenizer(newAttributeFactory(), userDictionary, DecompoundMode.NONE, false, false);
    return new TokenStreamComponents(tokenizer, tokenizer);
  }
};

When using it as false, users may think the following result strange. (at least I do) ex) Input : ...사이즈... Expect1 : [.][..][사이즈][.][..] Expect2 : [...][사이즈][...] Result : [...][사이즈][.][..]

How do you think about this?

apache / lucene

Handle punctuation characters in KoreanTokenizer [LUCENE-8977] #10020