Kuromoji tokenizer discards tokens if they start with a punctuation character [LUCENE-9390]

asfimport commented 4 years ago

This issue was first raised in Elasticsearch here

The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:

（株）,1285,1285,3690,名詞,一般,,,,,（株）,カブシキガイシャ,カブシキガイシャ

can be found in the Noun.csv file.

Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?

Migrated from LUCENE-9390 by Jim Ferenczi (@jimczi), updated Jun 19 2020 Pull requests: https://github.com/apache/lucene-solr/pull/1577

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation.

As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer...

asfimport commented 4 years ago

Jun Ohtani (@johtani) (migrated from JIRA)

IMO, we remove the flag and the kuromoji outputs punctuation characters (includes the token starting punctuation characters).

Then we can handle tokens with token filter. I think we can use the part of speech token filter to remove such tokens.

asfimport commented 4 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

> I usually set the "discardPunctuation" flag to False to avoid such subtle situation.

I thought that discardPunctuation set to false was relevant only in the context of the JapaneseNumberFilter.

The icu tokenizer removes the punctuation for instance so I am not sure it should be the default. *（株）*is kind of special since the parenthesis are required so it shouldn't need a breaking change to preserve this term in the Japanese tokenizer ?

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Not sure if it is related, but #10142 is another struggle with Kuromoji and punctuation.

asfimport commented 4 years ago

Jun Ohtani (@johtani) (migrated from JIRA)

Not exactly related, but we discussed around discard punctuation flag in https://issues.apache.org/jira/browse/SOLR-3524

asfimport commented 4 years ago

Jun Ohtani (@johtani) (migrated from JIRA)

I counted 3 types of words in ipadic csv files.

word that starts punctuation character : 101 words. only 4 words that length > 1
word that all punctuation character : 3 words
word that has punctuation without 1st char: 723 words

For no.3, just counted because I was curious it.

Reference : Word list.

https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd

4 words that starts punctuation are below: （社）（財）（有）（株）

all punctuation words are :

—— −− ──

asfimport commented 4 years ago

Jun Ohtani (@johtani) (migrated from JIRA)

I also checked UniDic around punctuation character, because I was working on https://github.com/apache/lucene-solr/pull/935 .

word that starts punctuation character : 606 words. 222 words that length > 1
word that all punctuation character : 111 words
word that has punctuation without 1st char: 1780 words

Here is the word list.

https://gist.github.com/johtani/3769639bc24ebeab17ddcb1be039ba94

asfimport commented 4 years ago

Jun Ohtani (@johtani) (migrated from JIRA)

I've made a pull request.

https://github.com/apache/lucene-solr/pull/1577

asfimport commented 4 years ago

Kazuaki Hiraga (@hkazuakey) (migrated from JIRA)

Hello,

I might have just remembered why @cmoen and I talked about this option.

If my memory is correct, the reason was *position* and *start / end offset*. For example, we have a keyword 日本語と「記号」の話し, and apply the tokenizer with discardPunctuation=true, the token position and offsets of tokens will be the following:

Token	日本語	と	記号	の	話し
Offset	0,3	3,4	5,7	8,9	9,11
Position	1	2	3	4	5

And the following are the results of tokenization that use char filter and token filter.

Applying PatternReplaceCharFilterFactory for removing some of punctuation before running the tokenizer.

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\/\・「」])" replacement=""/>

Token	日本語	と	記号	の	話し
Offset	0,3	3,5	5,8	8,9	9,11
Position	1	2	3	4	5

Applying PatternReplaceFilterFactory after applying the tokenizer.

<filter class="solr.PatternReplaceFilterFactory" pattern="([\/\・「」])" replacement="" replace="all"/>

Token	日本語	と	記号	の	話し
Offset	0,3	3,4	5,7	8,9	9,11
Position	1	2	4	6	7

I cannot remember what I wanted to do at that time, but it seems that the result of the former one that uses charFilter shows a reasonable result :) I might have been prefer the start/end offset that the tokenizer with discardPunctuation=true generates, but there's no good use case in my mind, i think the removing this option is reasonable.

asfimport commented 4 years ago

Jun Ohtani (@johtani) (migrated from JIRA)

Hi @hkazuakey ,

Thanks for your comment!

I agree with you about using char filter or token filter. And that is good to know how we can handle punctuation characters and position increments.

I think we can merge my pull request now.

And we may discuss removing this option in future version. Because removing this option is a big breaking change.

apache / lucene

Kuromoji tokenizer discards tokens if they start with a punctuation character [LUCENE-9390] #10430