WorksApplications / elasticsearch-sudachi

The Japanese analysis plugin for elasticsearch
Apache License 2.0
183 stars 41 forks source link

Unable to Reproduce Example as Described in Documentation #116

Closed arcoyk closed 5 months ago

arcoyk commented 8 months ago

Hi,

Thank you for the great plugin. I cannot reproduce the example written in the official documentation.

Input:

{
  "analyzer": "sudachi_analyzer",
  "text": "寿司がおいしいね"
}

Expected (as described in the document):

{
  "tokens": [
    {
      "token": "寿司",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "美味しい",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 2
    }
  ]
}

Actual (v3.1.0 release with Opensearch 2.6.0)

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "が",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "美味しい",
            "type": "word"
        }
    ]
}

If you think of any possible causes, please leave a comment. I appreciate your assistance.

kazuma-t commented 8 months ago

I assume you applied the sudachi_part_of_speech setting in the README, but I could not reproduce your results here. Please let us know your configuration file and the dictionaries you are using.

arcoyk commented 8 months ago

We are using the full dict and base configuration.

My apologies, the example provided is not quite accurate. I think the problem is actually related to a change in how baseform, readingform and normalizedform are being applied..

I tried to put together a reproducible, minimal example,

{
  "settings": {
    "index": {
      "analysis" : {
        "analyzer" : {
          "sudachi_analyzer" : {
            "filter" : [
              "sudachi_ja_stop",
              "sudachi_baseform"
            ],
            "type" : "custom",
            "tokenizer" : "sudachi_tokenizer"
          }
        },
        "tokenizer" : {
          "sudachi_tokenizer" : {
            "type" : "sudachi_tokenizer"
          }
        }
      }
    }
  }
}

If we analyze this text,

{
  "analyzer": "sudachi_analyzer",
  "text": "および"
}

I expect to get no tokens, because および is defined in stopwords.txt, and it should be removed in the sudachi_ja_stop filter. However, I still get one token,

{
  "tokens" : [
    {
      "token" : "および",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

This is in v3.1.0 running on OpenSearch 2.6.0 I have confirm in our old version, v2.1.0 running on Elasticsearch 7.10.2, this expected behavior occurs (no tokens).

I see from the changelog, in v3.0.0, there was a change related to how analysis chains are processed. Is this a side-effect of that?

arcoyk commented 8 months ago

On further investigation, this is the exact same issue as #111 Please close if you feel necessary, but it would be nice to resolve #111