elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
998 stars 24.82k forks source link

Support Case Insensitive search for foreign characters in wildcard field type #95120

Open jypan0115 opened 1 year ago

jypan0115 commented 1 year ago

Description

In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, for now it only works with ASCII characters. May I know why not support foreign characters like Vietnamese? It is not consist with keyword.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

cbuescher commented 1 year ago

Comment from this closed issue:

If we store it as keyword field and use case insensitive term query to search for Ngô Đức(Uppercase) or ngô đức(lowercase), it works fine. But if it is stored as wildcard field, it failed when I use case insensitive term query.

cbuescher commented 1 year ago

@jypan0115 as you pointed out in https://github.com/elastic/elasticsearch/pull/61596#issuecomment-1512808119 we don't apply the logic that handles case-insensitivity for codepoints outside the ASCII range. The reason for this currently isn't clear to me, we'd need to do some digging here to check for the reason to do so before removing this limitation, but it makes sense to me that we should also support characters outside the ASCII range if possible.

jypan0115 commented 1 year ago

@jypan0115 as you pointed out in #61596 (comment) we don't apply the logic that handles case-insensitivity for codepoints outside the ASCII range. The reason for this currently isn't clear to me, we'd need to do some digging here to check for the reason to do so before removing this limitation, but it makes sense to me that we should also support characters outside the ASCII range if possible.

@cbuescher Kindly checking any updates here?

cbuescher commented 1 year ago

Just for reference, the choice to limit the case_insensitive option at the time of introduction was a deliberate one. While digging a bit further and trying to find history around this decision I found this discussion on a related Lucene PR that adds a case insensitivity flag to the RegExp automaton class there. From a first look at it it seems there is no straight forward 1:1 mapping between lower and upper case letters for unicode in general. While for a lot of cases there might be such an unambiguous mapping, this needs more careful investigation and discussion.

jypan0115 commented 1 year ago

@cbuescher In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, by removing the limit of ASCII character if (codepoint > 128) { return case1; }, unicode character will be take care by this line int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint); while Character.isLowerCase(), Character.toUpperCase(), Character.toLowerCase() can all deal with unicode.

jypan0115 commented 1 year ago

@cbuescher @javanna Kindly checking any updates here?

nemphys commented 1 year ago

+1 on this one, I was trying to figure out why I was not getting the expected results when using the new case_insensitive setting (on greek strings) until I stumbled upon this issue.

jypan0115 commented 1 year ago

@cbuescher @javanna Kindly checking any updates here?

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)