elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.34k stars 24.55k forks source link

Korean (nori) tokenizer punctuation #80521

Open AyWa opened 2 years ago

AyWa commented 2 years ago

Elasticsearch version: 7.13.3 (tested on 7.15.1 too)

Plugins installed: [repository-s3, analysis-nori]

JVM version (java -version): Eclipse Adoptium/OpenJDK 64-Bit Server VM/17/17+35

OS version (uname -a if on a Unix-like system): Linux/4.9.184-linuxkit/amd64

Description of the problem including expected versus actual behavior:

The nori tokenizer change the token yield depending on punctuation. (even when discard punctuation is true). the decompound_mode has also no impact on the token yields. I guess it is a bug because if discard punctuation is true, the output should be the same.

{
  "tokenizer": "nori_tokenizer",
  "text":      "표현입니당)"
}

will have [표현, 입, 니, 당]

{
  "tokenizer": "nori_tokenizer",
  "text":      "표현입니당"
}

will have [표현, 입, 니당]

Steps to reproduce:

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)