elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.02k stars 24.82k forks source link

Korean tokenizer (Nori) doesn't split digits and letters #46365

Closed drakejin closed 4 years ago

drakejin commented 5 years ago

Describe the feature

Elasticsearch version (bin/elasticsearch --version):

6.7.2

Plugins installed: []

JVM version (java -version):

jvm 1.8

OS version (uname -a if on a Unix-like system):

ubuntu 16.04

Description of the problem including expected versus actual behavior:

I wanted to analyze 44사이즈비키니(44 size bikini :) ) so I execute the following script.

Steps to reproduce

  1. I check the Nouns 사이즈(size) and 비키니(bikini). I saw these nouns are NNP and NNG
  2. So, I compose of these words to '44사이즈비키니', and I send to analyze with nori plugin.

Elastic Settings And Analysis Result

Mapping Result

{
  "articles-alpha" : {
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "articles-alpha",
        "creation_date" : "1567669131498",
        "analysis" : {
          "analyzer" : {
            "korean" : {
              "filter" : [
                "lowercase",
              ],
              "type" : "custom",
              "tokenizer" : "nori_user_dict_tokenizer"
            }
          },
          "tokenizer" : {
            "nori_user_dict_tokenizer" : {
              "mode" : "mixed",
              "type" : "nori_tokenizer",
              "user_dictionary" : "nori/dict-service-noun"
            }
          }
        },
        "number_of_replicas" : "1"
      }
    }
  }
}

input

GET /articles-alpha/_analyze
{
  "text": "44사이즈비키니",
  "analyzer": "korean",
  "explain": true
}

Output

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "nori_user_dict_tokenizer",
      "tokens" : [
        {
          "token" : "44사이즈비키니",
          "start_offset" : 0,
          "end_offset" : 8,
          "type" : "word",
          "position" : 0,
          "bytes" : "[34 34 ec 82 ac ec 9d b4 ec a6 88 eb b9 84 ed 82 a4 eb 8b 88]",
          "leftPOS" : "UNKNOWN(Unknown)",
          "morphemes" : null,
          "posType" : "MORPHEME",
          "positionLength" : 1,
          "reading" : null,
          "rightPOS" : "UNKNOWN(Unknown)",
          "termFrequency" : 1
        }
      ]
   }
}
elasticmachine commented 5 years ago

Pinging @elastic/es-search

jimczi commented 5 years ago

Thanks for reporting @drake-jin . This is clearly a regression caused by https://issues.apache.org/jira/browse/LUCENE-8548. I opened https://issues.apache.org/jira/browse/LUCENE-8966 to fix this since digits should be not be grouped with other types of characters.

drakejin commented 5 years ago

@jimczi Could I ask what version is patched?

thanks always.

jimczi commented 5 years ago

Could I ask what version is patched?

The fix will be released in Lucene 8.3 so it should be available for a 7.x version of Elasticsearch. The sooner would be Elasticsearch 7.6 but there is no guarantee here.

drakejin commented 4 years ago

It's also, Doesn't split English Letters(Alphabets) and Digit...

PUT /test
{
  "number_of_shards" : "5",
  "analysis" : {
    "analyzer" : {
      "korean" : {
        "type" : "custom",
        "tokenizer" : "nori_user_dict_tokenizer"
      }
    },
    "tokenizer" : {
      "nori_user_dict_tokenizer" : {
        "mode" : "mixed",
        "type" : "nori_tokenizer"
      }
    }
  }
}
GET /test/_analyze
{
  "text": ["foo3", "Foo3", "FOO3"],
  "tokenizer": "nori_user_dict_tokenizer"
}

image

drakejin commented 4 years ago

@jimczi

It seems this issue is resolved!! So, You can close this issue. Good job.

# It's tested ES Version 7.7.1
curl -X POST "http://localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "nori",
  "text": "44사이즈비키니"
}
'
{
  "tokens" : [
    {
      "token" : "44",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "사이즈",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "비키니",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    }
  ]
}
jimczi commented 4 years ago

Thanks @drake-jin