The synonym filter is being influenced by other filters

togatoga commented 1 year ago

Hi team,

Thank you for a great plugin. I am really surprised and satisfied with the quality of Sudachi.

I've encountered some weird results and I'm unsure whether it's a bug or intended behavior.

Issue

It seems that the configuration of the synonym filter is being influenced by sudachi_part_of_speech. In a previous issue, it was suggested that the synonym filter should be applied last. However, applying it last appears to affect other filters. Is this behavior intentional?

Environment

elasticsearch-8.8.1-analysis-sudachi-3.1.0

Configuration (elasticsearch/index.json):

PUT /sudachi-test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sudachi_search_analyzer_c": {
          "type": "custom",
          "tokenizer": "sudachi_tokenizer_c",
          "discard_punctuation": true,
          "filter": [
            "sudachi_pos_filter",
            "synonym_filter"
          ]
        }
      },
      "tokenizer": {
        "sudachi_tokenizer_c": {
          "type": "sudachi_tokenizer",
          "split_mode": "C",
          "discard_punctuation": true
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "山ほど => 山程, たくさん, 一杯"
          ]
        },
        "sudachi_pos_filter": {
          "type": "sudachi_part_of_speech",
          "stoptags": [
            "代名詞",
            "形状詞-タリ",
            "形状詞-助動詞語幹",
            "連体詞",
            "接続詞",
            "感動詞",
            "助動詞",
            "助詞",
            "補助記号",
            "空白"
          ]
        }
      }
    }
  }
}

Query and result

GET /sudachi-test/_analyze
{
  "text": "山に行った",
  "analyzer": "sudachi_search_analyzer_c"
}

{
  "tokens": [
    {
      "token": "山程",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "たくさん",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "一",
      "start_offset": 0,
      "end_offset": 1,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "行っ",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "杯",
      "start_offset": 2,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 2
    }
  ]
}

GET /sudachi-test/_analyze
{
  "text": "山ほど遊んだ",
  "analyzer": "sudachi_search_analyzer_c"
}

{
  "tokens": [
    {
      "token": "山ほど",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "遊ん",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 1
    }
  ]
}

A phrase 山に行った and 山ほど遊んだ resulted in unexpected results and a synonym word 山ほど seems to be transformed into 山. (FYI A sudachi synonym dict contains 山ほど,一杯 )

Based on the query result, it appears that the order of the filters is affecting the outcome. While it was suggested to apply the synonym filter last, doing so seems to impact other filters. Is this behavior intentional, or is there a need for correction?

Here is my repository that I experimented. You can reproduce this issue.

If you swap the order of synonym_filter and sudachi_pos_filter, some queries 山 resulted in null_pointer_exception.

PUT /sudachi-test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "sudachi_search_analyzer_c": {
          "type": "custom",
          "tokenizer": "sudachi_tokenizer_c",
          "discard_punctuation": true,
          "filter": [
            "synonym_filter",
            "sudachi_pos_filter"
            ]
        }
      },
      "tokenizer": {
        "sudachi_tokenizer_c": {
          "type": "sudachi_tokenizer",
          "split_mode": "C",
          "discard_punctuation": true
        }
      },
      "filter": {
        "synonym_filter" : {
            "type" : "synonym",
            "synonyms": [
              "山ほど => 山程, たくさん, 一杯"
            ]
        },
        "sudachi_pos_filter": {
            "type": "sudachi_part_of_speech",
            "stoptags": [
              "代名詞",
              "形状詞-タリ",
              "形状詞-助動詞語幹",
              "連体詞",
              "接続詞",
              "感動詞",
              "助動詞",
              "助詞",
              "補助記号",
              "空白"
            ]
          }
      }
    }
  }
}

GET /sudachi-test/_analyze
{
  "text": "山",
  "analyzer": "sudachi_search_analyzer_c"
}

{
  "error": {
    "root_cause": [
      {
        "type": "null_pointer_exception",
        "reason": """Cannot invoke "com.worksap.nlp.sudachi.Morpheme.partOfSpeechId()" because "morpheme" is null"""
      }
    ],
    "type": "null_pointer_exception",
    "reason": """Cannot invoke "com.worksap.nlp.sudachi.Morpheme.partOfSpeechId()" because "morpheme" is null"""
  },
  "status": 500
}

togatoga commented 1 year ago

This issue might be related with https://github.com/elastic/elasticsearch/issues/28838

togatoga commented 1 year ago

I found an older documentation describing that we should not place a synonym filter after a stemmer filter.

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries. Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here. Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms, e.g. asciifolding will only produce the folded version of the token. Others, e.g. multiplexer, word_delimiter_graph or ngram will throw an error.

eiennohito commented 1 year ago

Thanks for the issue, looks very much like the description of the behavior in your third comment.

I will look into the behavior in more detail and give advice after that.

togatoga commented 12 months ago

@eiennohito Thank you for your quick action. I'm very interested in developing and contributing Sudachi/elasticsearch-sudachi. I'm looking into the behavior and the inside of code too. Where should I ask questions about Sudachi's code? Slack?

azagniotov commented 9 months ago

I would like to add something that I observe on my end: I am developing a Lucene plugin based on Sudachi tokenizer, but for Solr. I also tried to experiment with synonym filter (i.e.: SynonymGraphFilter in Lucene). As suggested, by @eiennohito I added the synonym filter last in the filter chain of the analyzer. I played around with the following synonyms 赤ちゃん,新生児,児

Unfortunately, when I use Solr's field analysis UI, for types SYNONYM, the Morpheme instance is null at runtime. In order to stop having NPEs, in my attribute classes that extends AttributeImpl, I implemented null checks when so that when reflectWith(AttributeReflector attributeReflector) is invoked, I do not get NPE due to a null Morpheme. Instead, in reflectWith for the missing morpheme metadata, I return n/a (see the attached screenshots).

After the reading the current and the https://github.com/elastic/elasticsearch/issues/28838 issues, plus my own experience, I feel this is not Elasticsearch specific. I think (of course, I may be wrong) this has to do with Lucene + Sudachi library, in general.

Screen#1

Screen#2

azagniotov commented 9 months ago

To add some information and refined my previous message.

So, after adding some error checking in the code of my own Solr Lucene Sudachi plugin filter and child classes of AttributeImpl code, I was able to achieve the following:

Filter behavior parity with the Lucene's built-in Kuromoji module
No more exception Cannot invoke "com.worksap.nlp.sudachi.Morpheme.partOfSpeechId()" because "morpheme" is null. In the current Elasticsearch Sudachi plugin, the SudachiPartOfSpeechStopFilter.kt needs a null check to avoid the partOfSpeechId related error.
I can configure SynonymGraphFilterFactory anywhere in my filter chain in Solr's schema.xml

A few screenshots to better demonstrate bullet point#1 - Filter behavior parity with the Lucene's built-in Kuromoji module:

Screenshot # 1 In the following screenshot you see the Solr field analysis screen when the default Kuromoji analyzer is in play, the synonyms are 赤ちゃん,新生児,児. I have added SynonymGraphFilterFactory 2nd in the filter chain, right after the tokenizer. Do note that field analysis when Kuromoji is in play does not display any metadata for the terms of type SYNONYM:

Screenshot # 2 In the following screenshot you see the Solr field analysis screen when the Sudachi analyzer is in play, the synonyms are still the same 赤ちゃん,新生児,児. Here as well, I have added SynonymGraphFilterFactory 2nd in the filter chain, right after the tokenizer. Do note, that I provided a Sudachi tokenizer in the Solr schema.xml to SynonymGraphFilterFactory, therefore you see that Sudachi tokenized the 新生児 to two tokens 新生, 児.

P.S. Also, to note, although in the above I am mentioning Solr, while the current repo is an Elasticsearch Sudachi plugin, this should not matter much because under the hood, both Solr and Elasticsearch leverage the same plugin/filter architecture provided by Lucene.

azagniotov commented 9 months ago

cc: @eiennohito ^ Please let me know your thoughts 🙇🏼‍♂️

mh-northlander commented 5 months ago

@azagniotov Thank you for your information. As you commented some problems ware on the es-sudachi side and we came to similar correction to yours (#122).

@togatoga As you mentioned, SynonymGraphFilter uses filters before it to parse the synonym entries. This is elasticsearch level behavior and we need to be aware of. Null pointer exception occurred with synonym filter -> sudachi pos filter was a bug (fixed by #122). We can use sudachi filters after synonym filter. Note that synonyms added do not have morpheme information and are not affected by sudachi filters.

WorksApplications / elasticsearch-sudachi