JVM version (java -version): java version "1.7.0_151"
OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)
OS version (uname -a if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
The ICU normalizer converts the non-combining form of Japanese diacritics dakuten (゛ U+309B) and handakuten (゜ U+309C) into a space + the combinging form.
For example, the two characters ヘ゜ (U+30D8 U+309C) are normalized as three characters ヘ ゚ (U+30D8 U+0020 U+309A). (Depending on your browser, various characters may not be visisble, and GitHub may also do odd things to these characters, so Unicode code points are provided.)
This interacts badly with the standard tokenizer, ICU tokenizer, and cjk_bigram filter.
Better results would be any of the following:
U+30D8 U+309A, i.e., ベ
U+30D9, i.e., ベ, the normalized form of U+30D8 U+309A above
U+30D8 U+309C, i.e., ヘ゜, the original input
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
Set up analyzer text as { "type": "custom", "filter": ["icu_normalizer"], "tokenizer": "standard" }
Elasticsearch version (
bin/elasticsearch --version
): 5.3.2Plugins installed: [analysis-hebrew, analysis-icu, analysis-smartcn, analysis-stconvert, analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, ltr-query]
JVM version (
java -version
): java version "1.7.0_151" OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1) OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)OS version (
uname -a
if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior:
The ICU normalizer converts the non-combining form of Japanese diacritics dakuten (゛ U+309B) and handakuten (゜ U+309C) into a space + the combinging form.
For example, the two characters ヘ゜ (U+30D8 U+309C) are normalized as three characters ヘ ゚ (U+30D8 U+0020 U+309A). (Depending on your browser, various characters may not be visisble, and GitHub may also do odd things to these characters, so Unicode code points are provided.)
This interacts badly with the standard tokenizer, ICU tokenizer, and cjk_bigram filter.
Better results would be any of the following:
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including (e.g.) index creation, mappings, settings, query etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.
text
as{ "type": "custom", "filter": ["icu_normalizer"], "tokenizer": "standard" }
curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "ヘ゜" }'
For reference, the two characters being analyzed above are U+30D8 U+309C, and the three characters in the token below are U+30D8 U+0020 U+309A.