Open asfimport opened 4 years ago
Jim Ferenczi (@jimczi) (migrated from JIRA)
I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?
Namgyu Kim (@danmuzi) (migrated from JIRA)
Sorry for late reply. @jimczi :(
First, I'll modify this issue from Bug to Improvement because it is ambiguous to see it as a bug.
I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?
As you said, the biggest purpose is KoreanNumberFilter. However, users can simply use discardPunctuation option of KoreanTokenizer. (not use KoreanNumberFilter)
Analyzer myAnalyzer = new Analyzer() {
`@Override`
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KoreanTokenizer(newAttributeFactory(), userDictionary, DecompoundMode.NONE, false, false);
return new TokenStreamComponents(tokenizer, tokenizer);
}
};
When using it as false, users may think the following result strange. (at least I do) ex) Input : ...사이즈... Expect1 : [.][..][사이즈][.][..] Expect2 : [...][사이즈][...] Result : [...][사이즈][.][..]
How do you think about this?
As we discussed on #10009, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks. (사이즈.... => [사이즈] [.] [...]) But KoreanTokenizer doesn't divide when first character is punctuation. (...사이즈 => [...] [사이즈])
It looks like the result from the viterbi path, but users can think weird about the following case: ("사이즈" means "size" in Korean)
From what I checked, Nori has a punctuation characters(like . ,) in the dictionary but Kuromoji is not. ("サイズ" means "size" in Japanese)
There are some ways to resolve it like hard-coding for punctuation but it seems not good. So I think we need to discuss it.
Migrated from LUCENE-8977 by Namgyu Kim (@danmuzi), updated Sep 18 2019