Open jypan0115 opened 1 year ago
Pinging @elastic/es-search (Team:Search)
Comment from this closed issue:
If we store it as keyword field and use case insensitive term query to search for Ngô Đức(Uppercase) or ngô đức(lowercase), it works fine. But if it is stored as wildcard field, it failed when I use case insensitive term query.
@jypan0115 as you pointed out in https://github.com/elastic/elasticsearch/pull/61596#issuecomment-1512808119 we don't apply the logic that handles case-insensitivity for codepoints outside the ASCII range. The reason for this currently isn't clear to me, we'd need to do some digging here to check for the reason to do so before removing this limitation, but it makes sense to me that we should also support characters outside the ASCII range if possible.
@jypan0115 as you pointed out in #61596 (comment) we don't apply the logic that handles case-insensitivity for codepoints outside the ASCII range. The reason for this currently isn't clear to me, we'd need to do some digging here to check for the reason to do so before removing this limitation, but it makes sense to me that we should also support characters outside the ASCII range if possible.
@cbuescher Kindly checking any updates here?
Just for reference, the choice to limit the case_insensitive
option at the time of introduction was a deliberate one. While digging a bit further and trying to find history around this decision I found this discussion on a related Lucene PR that adds a case insensitivity flag to the RegExp automaton class there. From a first look at it it seems there is no straight forward 1:1 mapping between lower and upper case letters for unicode in general. While for a lot of cases there might be such an unambiguous mapping, this needs more careful investigation and discussion.
@cbuescher In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, by removing the limit of ASCII character if (codepoint > 128) { return case1; }
, unicode character will be take care by this line int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
while Character.isLowerCase(), Character.toUpperCase(), Character.toLowerCase() can all deal with unicode.
@cbuescher @javanna Kindly checking any updates here?
+1 on this one, I was trying to figure out why I was not getting the expected results when using the new case_insensitive setting (on greek strings) until I stumbled upon this issue.
@cbuescher @javanna Kindly checking any updates here?
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Description
In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, for now it only works with ASCII characters. May I know why not support foreign characters like Vietnamese? It is not consist with keyword.