Closed togatoga closed 5 months ago
This issue might be related with https://github.com/elastic/elasticsearch/issues/28838
I found an older documentation describing that we should not place a synonym filter after a stemmer filter.
Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries. Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here. Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms, e.g. asciifolding will only produce the folded version of the token. Others, e.g. multiplexer, word_delimiter_graph or ngram will throw an error.
Thanks for the issue, looks very much like the description of the behavior in your third comment.
I will look into the behavior in more detail and give advice after that.
@eiennohito Thank you for your quick action. I'm very interested in developing and contributing Sudachi/elasticsearch-sudachi. I'm looking into the behavior and the inside of code too. Where should I ask questions about Sudachi's code? Slack?
I would like to add something that I observe on my end: I am developing a Lucene plugin based on Sudachi tokenizer, but for Solr. I also tried to experiment with synonym filter (i.e.: SynonymGraphFilter
in Lucene). As suggested, by @eiennohito I added the synonym filter last in the filter chain of the analyzer. I played around with the following synonyms 赤ちゃん,新生児,児
Unfortunately, when I use Solr's field analysis UI, for types SYNONYM
, the Morpheme instance is null at runtime. In order to stop having NPEs, in my attribute classes that extends AttributeImpl
, I implemented null checks when so that when reflectWith(AttributeReflector attributeReflector)
is invoked, I do not get NPE due to a null Morpheme. Instead, in reflectWith
for the missing morpheme metadata, I return n/a
(see the attached screenshots).
After the reading the current and the https://github.com/elastic/elasticsearch/issues/28838 issues, plus my own experience, I feel this is not Elasticsearch specific. I think (of course, I may be wrong) this has to do with Lucene + Sudachi library, in general.
Screen#1
Screen#2
To add some information and refined my previous message.
So, after adding some error checking in the code of my own Solr Lucene Sudachi plugin filter and child classes of AttributeImpl
code, I was able to achieve the following:
Cannot invoke "com.worksap.nlp.sudachi.Morpheme.partOfSpeechId()" because "morpheme" is null
. In the current Elasticsearch Sudachi plugin, the SudachiPartOfSpeechStopFilter.kt needs a null check to avoid the partOfSpeechId
related error.SynonymGraphFilterFactory
anywhere in my filter chain in Solr's schema.xmlA few screenshots to better demonstrate bullet point#1 - Filter behavior parity with the Lucene's built-in Kuromoji module:
Screenshot # 1
In the following screenshot you see the Solr field analysis screen when the default Kuromoji analyzer is in play, the synonyms are 赤ちゃん,新生児,児
. I have added SynonymGraphFilterFactory
2nd in the filter chain, right after the tokenizer. Do note that field analysis when Kuromoji is in play does not display any metadata for the terms of type SYNONYM
:
Screenshot # 2
In the following screenshot you see the Solr field analysis screen when the Sudachi analyzer is in play, the synonyms are still the same 赤ちゃん,新生児,児
. Here as well, I have added SynonymGraphFilterFactory
2nd in the filter chain, right after the tokenizer. Do note, that I provided a Sudachi tokenizer in the Solr schema.xml to SynonymGraphFilterFactory
, therefore you see that Sudachi tokenized the 新生児
to two tokens 新生
, 児
.
P.S. Also, to note, although in the above I am mentioning Solr, while the current repo is an Elasticsearch Sudachi plugin, this should not matter much because under the hood, both Solr and Elasticsearch leverage the same plugin/filter architecture provided by Lucene.
cc: @eiennohito ^ Please let me know your thoughts 🙇🏼♂️
@azagniotov Thank you for your information. As you commented some problems ware on the es-sudachi side and we came to similar correction to yours (#122).
@togatoga
As you mentioned, SynonymGraphFilter uses filters before it to parse the synonym entries. This is elasticsearch level behavior and we need to be aware of.
Null pointer exception occurred with synonym filter
-> sudachi pos filter
was a bug (fixed by #122). We can use sudachi filters after synonym filter. Note that synonyms added do not have morpheme information and are not affected by sudachi filters.
Hi team,
Thank you for a great plugin. I am really surprised and satisfied with the quality of
Sudachi
.I've encountered some weird results and I'm unsure whether it's a bug or intended behavior.
Issue
It seems that the configuration of the synonym filter is being influenced by
sudachi_part_of_speech
. In a previous issue, it was suggested that the synonym filter should be applied last. However, applying it last appears to affect other filters. Is this behavior intentional?Environment
Configuration (elasticsearch/index.json):
Query and result
A phrase
山に行った
and山ほど遊んだ
resulted in unexpected results and a synonym word山ほど
seems to be transformed into山
. (FYI A sudachi synonym dict contains山ほど,一杯
)Based on the query result, it appears that the order of the filters is affecting the outcome. While it was suggested to apply the synonym filter last, doing so seems to impact other filters. Is this behavior intentional, or is there a need for correction?
Here is my repository that I experimented. You can reproduce this issue.
If you swap the order of
synonym_filter
andsudachi_pos_filter
, some queries山
resulted innull_pointer_exception
.