Korean (nori) tokenizer punctuation

AyWa commented 2 years ago

Elasticsearch version: 7.13.3 (tested on 7.15.1 too)

Plugins installed: [repository-s3, analysis-nori]

JVM version (java -version): Eclipse Adoptium/OpenJDK 64-Bit Server VM/17/17+35

OS version (uname -a if on a Unix-like system): Linux/4.9.184-linuxkit/amd64

Description of the problem including expected versus actual behavior:

The nori tokenizer change the token yield depending on punctuation. (even when discard punctuation is true). the decompound_mode has also no impact on the token yields. I guess it is a bug because if discard punctuation is true, the output should be the same.

{
  "tokenizer": "nori_tokenizer",
  "text":      "표현입니당)"
}

will have [표현, 입, 니, 당]

{
  "tokenizer": "nori_tokenizer",
  "text":      "표현입니당"
}

will have [표현, 입, 니당]

Steps to reproduce:

install elasticsearch + analysis-nori.

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elastic / elasticsearch

Korean (nori) tokenizer punctuation #80521