elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.42k stars 24.57k forks source link

Plugins analysis-stempel incorrect tokens generation for numbers #71483

Open domsew opened 3 years ago

domsew commented 3 years ago

Actual: I observed unexpected behaviour. Some numbers are affected by stemmer. It causes wrong search results. For example "2021" -> "20ć".

Expected: string numbers should not be changed.

Reproduce: request:

POST _analyze
{
  "tokenizer": "standard",
  "filter": ["polish_stem"],
  "text": "2021"
}

response:

{
  "tokens": [
    {
      "token": "20ć",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<NUM>",
      "position": 0
    }
  ]
}

Elasticsearch version 7.11.2:

Plugins installed: [analysis-stempel]

OS version CentOS

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

jimczi commented 3 years ago

Sorrry for the late reply @domsew. The logic for this plugin is implemented in Lucene so could you please open an issue here. We're just using the Lucene module in ES so we cannot change how it works internally.

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)