Closed AnSungHyun closed 3 years ago
Pinging @elastic/es-search
Thanks for reporting this problem @AnSungHyun .
This is similar than https://github.com/elastic/elasticsearch/pull/34331 except that it occurs in a Tokenizer. The synonym
filter checks that the input synonyms can be analyzed in a single form and fails to build if not. Since the mixed
mode of the Korean tokenizer preserves the compound and the splitted form it is not possible currently to add a compound word in a synonym dictionary. I discussed with @romseygeek offline and we think that it is possible to add the same workaround than #34331 for tokenizers. This would allow us to change the tokenizer option when we build the synonym map. In this case we'd change the mixed
mode to discard
(removes the compound) in order to make it compatible with the synonym building.
I forgot the fact that the output should also contains the compound and the decompound form of the expanded synonyms. Unfortunately this is not possible in the synonym
filter so the proposed solution above wouldn't work. Another possibility is to extract the de-compounding in a separate token filter instead of doing it in the tokenizer
. This way it would be possible to set the synonym
filter before the decompounding filter and the tokenizer would always output a single path.
@AnSungHyun I am not sure this is the right way to solve this issue. But, I believe this can be a workaround for you. In my case, I registered "대한민국,한국,코리아" and I met the same issue like you. "대한민국" is a compound word so it makes the same error exactly. However I added "대한민국" to user-dictionary and this error went away.
Here is my settings.
PUT test
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"nori_user_dict": {
"type": "nori_tokenizer",
"decompound_mode": "mixed",
"user_dictionary": "userdict_ko.txt",
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "nori_user_dict",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/synonyms.txt"
}
}
}
}
},
...
}
And then added "대한민국" to userdict_ko.txt.
I hope this is helpful for you.
I am closing this issue as won't fix for now. Using. the mixed
mode of the nori
tokenizer doesn't work with multi-word synonyms but this is more broader problem. The solution for now is to use the discard
mode in order to ensure that a single path is produced.
Error When Index Setting "Synonym Filter" with "Korean (nori) Analysis"
Elasticsearch version (
bin/elasticsearch --version
): 6.5.3Plugins installed: [ analysis-nori ]
JVM version (
java -version
): java version "1.8.0_121"OS version (
uname -a
if on a Unix-like system): Linux search 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior:
Steps to reproduce:
1. Korean (nori) Analysis Install bin/elasticsearch-plugin install analysis-nori
2. Index Setting Index Create:
3. Error Message
4. I tried synonym graph filter, but It was not resolved. Index Create:
5. analyze token result after remove synonym filter Index Create:
Try Analyze:
Result:
"풋사과" is compound words Can not use synonyms in compound words?