StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java [LUCENE-8572]

asfimport commented 5 years ago

With "lucene-queryparser-6.3.0", specifically in "org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java" when escaping strings containing extended unicode chars, and with a locale distinct from that of the character set the string uses, the process fails, with a "java.lang.StringIndexOutOfBoundsException". The reason is that the comparison is done by previously converting all of the characters of the string to lower case chars, and by doing this, the original string size isn't anymore the same, but less, as of the transformed one, so that executing org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:89

fails with a java.lang.StringIndexOutOfBoundsException.

I wonder whether the transformation to lower case is really needed when treating the escape chars, since by avoiding it, the error may be avoided.

Migrated from LUCENE-8572 by Octavian Mocanu (@tonicava), updated Aug 10 2019

asfimport commented 5 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Hi @tonicava, thanks for opening the issue. Would you be able to provide an example or test case that illustrates the failure?

asfimport commented 5 years ago

Octavian Mocanu (@tonicava) (migrated from JIRA)

Hi @romseygeek,

Trying e.g. at

/org/apache/lucene/queryparser/flexible/standard/parser/EscapeQuerySyntaxImpl.java:83

in

private static final CharSequence escapeTerm(CharSequence term, Locale locale)

with

term = "İpone " [304, 112, 111, 110, 101, 32]
locale = "us"

result -> StringIndexOutOfBoundsException

(it'll only work when having locale = "tr")

asfimport commented 5 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

It looks as though FieldQueryNode and PathQueryNode are using Locale.getDefault() when they should be using Locale.ROOT

cc [\~thetaphi‍] who understands locales better than I do...

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

Hi, @romseygeek, @uschindler.

I checked the issue and found that it could be a logical problem.

First, I think it's not a Locale problem, but a replace algorithm(replaceIgnoreCase) itself.

When you see the escapeWhiteChar(), it calls the replaceIgnoreCase() internally. (escapeTerm() -> escapeWhiteChar() -> replaceIgnoreCase())

private static CharSequence replaceIgnoreCase(CharSequence string,
    CharSequence sequence1, CharSequence escapeChar, Locale locale) {
  // string = "İpone " [304, 112, 111, 110, 101, 32],  size = 6
  ...
  while (start < count) {
    // Convert by toLowerCase as follows.
    // string = "i'̇pone " [105, 775, 112, 111, 110, 101, 32], size = 7
    // firstIndex will be set 6.
    if ((firstIndex = string.toString().toLowerCase(locale).indexOf(first,
        start)) == -1)
      break;
    boolean found = true;
    ...
    if (found) {
      // In this line, String.toString() will only have a range of 0 to 5.
      // So here we get a StringIndexOutOfBoundsException.
      result.append(string.toString().substring(copyStart, firstIndex));
      ...
    } else {
      start = firstIndex + 1;
    }
  }
  ...
}

Solving this may not be a big problem.

But what do you think about using

public static final CharSequence escapeWhiteChar(CharSequence str,
      Locale locale) {
    ...

    for (int i = 0; i < escapableWhiteChars.length; i++) {
      // Use String's replace method.
      buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
    }
    return buffer;
  }

instead of

public static final CharSequence escapeWhiteChar(CharSequence str,
      Locale locale) {
    ...

    for (int i = 0; i < escapableWhiteChars.length; i++) {
      // Stay current method.
      buffer = replaceIgnoreCase(buffer, escapableWhiteChars[i].toLowerCase(locale), "\\", locale);
    }
    return buffer;
  }

in the escapeWhiteChar method?

asfimport commented 5 years ago

Octavian Mocanu (@tonicava) (migrated from JIRA)

Hi, in my opinion, the proposal of @danmuzi would be a nice solution, mainly for it avoids the problematic use of toLowerCase.

Best!

asfimport commented 5 years ago

Chongchen Chen (migrated from JIRA)

I find the code is relative to #5271 . Maybe we should implement it like:

public static final CharSequence escapeWhiteChar(CharSequence str,
      Locale locale) {
    ...

    for (int i = 0; i < escapableWhiteChars.length; i++) {
      buffer = buffer.toString().replace(escapableWhiteChars[i], "\\");
      buffer = buffer.toString().replace(escapableWhiteChars[i].toLowerCase(locale), "\\");
    }
    return buffer;
  }

apache / lucene

StringIndexOutOfBoundsException in parser/EscapeQuerySyntaxImpl.java [LUCENE-8572] #9618