apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute [LUCENE-9088] #10130

Open asfimport opened 4 years ago

asfimport commented 4 years ago

According to the JapaneseNumberFilter javadocs, it uses the attribute values of the last token used to compose the normalized number, which can be wrong. While this is documented it leads to a number of incompatibilities with other japanese token filters.

For example, the PartOfSpeechAttribute of the last token used for an input text of "2008 2009" will lead to an the following output (some attributes left out...):


{
 "token" : "2008",
 "start_offset" : 0,
 "end_offset" : 4,
 "type" : "word",
[...]

"partOfSpeech" : "記号-空白",
 "partOfSpeech (en)" : "symbol-space"

[...]
 },
 {
 "token" : " ",
 "start_offset" : 4,
 "end_offset" : 5,
 "type" : "word",

[...]
"partOfSpeech" : "記号-空白",
 "partOfSpeech (en)" : "symbol-space",
[...]
 },
 {
 "token" : "2009",
 "start_offset" : 5,
 "end_offset" : 9,
 "type" : "word",
...
 "partOfSpeech" : "名詞-数",
 "partOfSpeech (en)" : "noun-numeric",
 }

so that e.g. a following <font color="#1d1c1d">kuromoji_part_of_speech</font> filter will eliminate the "2008" token erroneously tagged as "symbol-space".

Even without fixing the other token attrobutes, the POS attributes should IMHO be set to "noun-numeric", since that's what the filter is supposed to detect.


Migrated from LUCENE-9088 by Christoph Büscher (@cbuescher), updated Dec 11 2019

asfimport commented 4 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I don't think this behavior is documented. The javadocs says: 

 * Also notice that token attributes such as
* \{`@link` org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute},
* \{`@link` org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute},
* \{`@link` org.apache.lucene.analysis.ja.tokenattributes.InflectionAttribute} and
* \{`@link` org.apache.lucene.analysis.ja.tokenattributes.BaseFormAttribute} are left
* unchanged and will inherit the values of the last token used to compose the normalized
* number and can be wrong. Hence, for 10万 (10000), we will have
* \{`@link` org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute}
* set to マン. This is a known issue and is subject to a future improvement.
* <p>

but that doesn't explain why we use the POS of the token following a grouped number. IMO this is a bug that we should fix in order to ensure that the POS stop filter can be used to remove the punctuations that was needed to detect the numbers.