Korean (nori) Analysis Synonym Filter build failed

AnSungHyun commented 5 years ago

Error When Index Setting "Synonym Filter" with "Korean (nori) Analysis"

Elasticsearch version (bin/elasticsearch --version): 6.5.3

Plugins installed: [ analysis-nori ]

JVM version (java -version): java version "1.8.0_121"

OS version (uname -a if on a Unix-like system): Linux search 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

1. Korean (nori) Analysis Install bin/elasticsearch-plugin install analysis-nori

2. Index Setting Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter": [
              "synonym"
            ]
          }
        }
      }
    }
  }
}

3. Error Message

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[node-test][192.168.0.1:9300][indices:admin/create]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "parse_exception: Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: 풋사과 analyzed to a token (풋) with position increment != 1 (got: 0)"
      }
    }
  },
  "status": 400
}

4. I tried synonym graph filter, but It was not resolved. Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter": [
              "synonym_graph"
            ]
          }
        }
      }
    }
  }
}

5. analyze token result after remove synonym filter Index Create:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
        "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "synonym_graph": {
            "type": "synonym_graph",
            "synonyms": [
              "풋사과,햇사과"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

Try Analyze:

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "풋사과"
}

Result:

{
  "tokens" : [
    {
      "token" : "풋사과",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "풋",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "사과",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

"풋사과" is compound words Can not use synonyms in compound words?

elasticmachine commented 5 years ago

Pinging @elastic/es-search

jimczi commented 5 years ago

Thanks for reporting this problem @AnSungHyun . This is similar than https://github.com/elastic/elasticsearch/pull/34331 except that it occurs in a Tokenizer. The synonym filter checks that the input synonyms can be analyzed in a single form and fails to build if not. Since the mixed mode of the Korean tokenizer preserves the compound and the splitted form it is not possible currently to add a compound word in a synonym dictionary. I discussed with @romseygeek offline and we think that it is possible to add the same workaround than #34331 for tokenizers. This would allow us to change the tokenizer option when we build the synonym map. In this case we'd change the mixed mode to discard (removes the compound) in order to make it compatible with the synonym building.

jimczi commented 5 years ago

I forgot the fact that the output should also contains the compound and the decompound form of the expanded synonyms. Unfortunately this is not possible in the synonym filter so the proposed solution above wouldn't work. Another possibility is to extract the de-compounding in a separate token filter instead of doing it in the tokenizer. This way it would be possible to set the synonym filter before the decompounding filter and the tokenizer would always output a single path.

seohoryu commented 5 years ago

@AnSungHyun I am not sure this is the right way to solve this issue. But, I believe this can be a workaround for you. In my case, I registered "대한민국,한국,코리아" and I met the same issue like you. "대한민국" is a compound word so it makes the same error exactly. However I added "대한민국" to user-dictionary and this error went away.

Here is my settings.

PUT test    
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary": "userdict_ko.txt",
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict",
            "filter" : ["synonym"]
          }
        },
        "filter" : { 
          "synonym" : {
            "type" : "synonym",
            "synonyms_path" : "analysis/synonyms.txt" 
          } 
        }
      }
    }
  },
...
}

And then added "대한민국" to userdict_ko.txt.

I hope this is helpful for you.

jimczi commented 3 years ago

I am closing this issue as won't fix for now. Using. the mixed mode of the nori tokenizer doesn't work with multi-word synonyms but this is more broader problem. The solution for now is to use the discard mode in order to ensure that a single path is produced.

elastic / elasticsearch

Korean (nori) Analysis Synonym Filter build failed #37751