Open asfimport opened 5 years ago
Alan Woodward (@romseygeek) (migrated from JIRA)
Hi @tonicava, thanks for opening the issue. Would you be able to provide an example or test case that illustrates the failure?
Octavian Mocanu (@tonicava) (migrated from JIRA)
Hi @romseygeek,
Trying e.g. at
/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:83
in
private static final CharSequence escapeTerm(CharSequence term, Locale locale)
with
term = "İpone " [304, 112, 111, 110, 101, 32]
locale = "us"
result -> StringIndexOutOfBoundsException
(it'll only work when having locale = "tr")
Alan Woodward (@romseygeek) (migrated from JIRA)
It looks as though FieldQueryNode and PathQueryNode are using Locale.getDefault()
when they should be using Locale.ROOT
cc [\~thetaphi] who understands locales better than I do...
Namgyu Kim (@danmuzi) (migrated from JIRA)
Hi, @romseygeek, @uschindler.
I checked the issue and found that it could be a logical problem.
First, I think it's not a Locale problem, but a replace algorithm(replaceIgnoreCase) itself.
When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally. (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase())
private static CharSequence replaceIgnoreCase(CharSequence string,
CharSequence sequence1, CharSequence escapeChar, Locale locale) {
// string = "İpone " [304, 112, 111, 110, 101, 32], size = 6
...
while (start < count) {
// Convert by toLowerCase as follows.
// string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
// firstIndex will be set 6.
if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
start)) == -1)
break;
boolean found = true;
...
if (found) {
// In this line, String.toString() will only have a range of 0 to 5.
// So here we get a StringIndexOutOfBoundsException.
result.append(string.toString().substring(copyStart, firstIndex));
...
} else {
start = firstIndex + 1;
}
}
...
}
Solving this may not be a big problem.
But what do you think about using
public static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
...
for (int i = 0; i < escapableWhiteChars.length; i++) {
// Use String's replace method.
buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
}
return buffer;
}
instead of
public static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
...
for (int i = 0; i < escapableWhiteChars.length; i++) {
// Stay current method.
buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale);
}
return buffer;
}
in the escapeWhiteChar method?
Octavian Mocanu (@tonicava) (migrated from JIRA)
Hi, in my opinion, the proposal of @danmuzi would be a nice solution, mainly for it avoids the problematic use of toLowerCase.
Best!
Chongchen Chen (migrated from JIRA)
I find the code is relative to #5271 . Maybe we should implement it like:
public static final CharSequence escapeWhiteChar(CharSequence str,
Locale locale) {
...
for (int i = 0; i < escapableWhiteChars.length; i++) {
buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
buffer = buffer.toString().replace(escapableWhiteChars[i].toLowerCase(locale), "\\");
}
return buffer;
}
With "lucene-queryparser-6.3.0", specifically in "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java" when escaping strings containing extended unicode chars, and with a locale distinct from that of the character set the string uses, the process fails, with a "java.lang.StringIndexOutOfBoundsException". The reason is that the comparison is done by previously converting all of the characters of the string to lower case chars, and by doing this, the original string size isn't anymore the same, but less, as of the transformed one, so that executing org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:89
fails with a java.lang.StringIndexOutOfBoundsException.
I wonder whether the transformation to lower case is really needed when treating the escape chars, since by avoiding it, the error may be avoided.
Migrated from LUCENE-8572 by Octavian Mocanu (@tonicava), updated Aug 10 2019