elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.92k stars 24.73k forks source link

ICU Normalizer adds spaces before certain non-combining dakuten and handakuten #27292

Open Trey314159 opened 6 years ago

Trey314159 commented 6 years ago

Elasticsearch version (bin/elasticsearch --version): 5.3.2

Plugins installed: [analysis-hebrew, analysis-icu, analysis-smartcn, analysis-stconvert, analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, ltr-query]

JVM version (java -version): java version "1.7.0_151" OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1) OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)

OS version (uname -a if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

The ICU normalizer converts the non-combining form of Japanese diacritics dakuten (゛ U+309B) and handakuten (゜ U+309C) into a space + the combinging form.

For example, the two characters ヘ゜ (U+30D8 U+309C) are normalized as three characters ヘ ゚ (U+30D8 U+0020 U+309A). (Depending on your browser, various characters may not be visisble, and GitHub may also do odd things to these characters, so Unicode code points are provided.)

This interacts badly with the standard tokenizer, ICU tokenizer, and cjk_bigram filter.

Better results would be any of the following:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) index creation, mappings, settings, query etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Set up analyzer text as { "type": "custom", "filter": ["icu_normalizer"], "tokenizer": "standard" }
  2. curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "ヘ゜" }'

For reference, the two characters being analyzed above are U+30D8 U+309C, and the three characters in the token below are U+30D8 U+0020 U+309A.

{
  "tokens" : [
    {
      "token" : "ヘ ゚",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<KATAKANA>",
      "position" : 0
    }
  ]
}
DaveCTurner commented 6 years ago

Related to #27291 and #27290. Thanks for the detailed reports @Trey314159!

mayya-sharipova commented 6 years ago

cc @elastic/es-search-aggs

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)