Open asfimport opened 4 years ago
Tomoko Uchida (@mocobeta) (migrated from JIRA)
Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation.
As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer...
Jun Ohtani (@johtani) (migrated from JIRA)
IMO, we remove the flag and the kuromoji outputs punctuation characters (includes the token starting punctuation characters).
Then we can handle tokens with token filter. I think we can use the part of speech token filter to remove such tokens.
Jim Ferenczi (@jimczi) (migrated from JIRA)
> I usually set the "discardPunctuation" flag to False to avoid such subtle situation.
I thought that discardPunctuation set to false was relevant only in the context of the JapaneseNumberFilter.
The icu
tokenizer removes the punctuation for instance so I am not sure it should be the default. *(株)*is kind of special since the parenthesis are required so it shouldn't need a breaking change to preserve this term in the Japanese tokenizer ?
Michael McCandless (@mikemccand) (migrated from JIRA)
Not sure if it is related, but #10142 is another struggle with Kuromoji and punctuation.
Jun Ohtani (@johtani) (migrated from JIRA)
Not exactly related, but we discussed around discard punctuation flag in https://issues.apache.org/jira/browse/SOLR-3524
Jun Ohtani (@johtani) (migrated from JIRA)
I counted 3 types of words in ipadic csv files.
For no.3, just counted because I was curious it.
Reference : Word list.
https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd
4 words that starts punctuation are below: (社) (財) (有) (株)
all punctuation words are :
—— −− ──
Jun Ohtani (@johtani) (migrated from JIRA)
I also checked UniDic around punctuation character, because I was working on https://github.com/apache/lucene-solr/pull/935 .
Here is the word list.
https://gist.github.com/johtani/3769639bc24ebeab17ddcb1be039ba94
Jun Ohtani (@johtani) (migrated from JIRA)
I've made a pull request.
Kazuaki Hiraga (@hkazuakey) (migrated from JIRA)
Hello,
I might have just remembered why @cmoen and I talked about this option.
If my memory is correct, the reason was *position* and *start / end offset*. For example, we have a keyword 日本語と「記号」の話し, and apply the tokenizer with discardPunctuation=true, the token position and offsets of tokens will be the following:
Token | 日本語 | と | 記号 | の | 話し |
---|---|---|---|---|---|
Offset | 0,3 | 3,4 | 5,7 | 8,9 | 9,11 |
Position | 1 | 2 | 3 | 4 | 5 |
And the following are the results of tokenization that use char filter and token filter.
Applying PatternReplaceCharFilterFactory for removing some of punctuation before running the tokenizer.
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\/\・「」])" replacement=""/>
Token | 日本語 | と | 記号 | の | 話し |
---|---|---|---|---|---|
Offset | 0,3 | 3,5 | 5,8 | 8,9 | 9,11 |
Position | 1 | 2 | 3 | 4 | 5 |
Applying PatternReplaceFilterFactory after applying the tokenizer.
<filter class="solr.PatternReplaceFilterFactory" pattern="([\/\・「」])" replacement="" replace="all"/>
Token | 日本語 | と | 記号 | の | 話し |
---|---|---|---|---|---|
Offset | 0,3 | 3,4 | 5,7 | 8,9 | 9,11 |
Position | 1 | 2 | 4 | 6 | 7 |
I cannot remember what I wanted to do at that time, but it seems that the result of the former one that uses charFilter shows a reasonable result :) I might have been prefer the start/end offset that the tokenizer with discardPunctuation=true generates, but there's no good use case in my mind, i think the removing this option is reasonable.
Jun Ohtani (@johtani) (migrated from JIRA)
Hi @hkazuakey ,
Thanks for your comment!
I agree with you about using char filter or token filter. And that is good to know how we can handle punctuation characters and position increments.
I think we can merge my pull request now.
And we may discuss removing this option in future version. Because removing this option is a big breaking change.
This issue was first raised in Elasticsearch here
The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:
(株),1285,1285,3690,名詞,一般,,,,,(株),カブシキガイシャ,カブシキガイシャ
can be found in the Noun.csv file.
Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?
Migrated from LUCENE-9390 by Jim Ferenczi (@jimczi), updated Jun 19 2020 Pull requests: https://github.com/apache/lucene-solr/pull/1577