apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Query parser fails to parse a range query string when there are escaped brackets inside the range #13234

Open marko-bekhta opened 7 months ago

marko-bekhta commented 7 months ago

Description

Assume there's a query parser created, e.g.:

Analyzer analyzer = new ClassicAnalyzer();
QueryParser queryParser = new QueryParser( "field", analyzer );

trying to parse simple ranges like:

// simple range query with no escapes works fine:
Query query = queryParser.parse( "[ 1 TO 10 ]" );

works as expected and a TermRangeQuery is created with no exceptions.

But if the range in the string contains some escaped brackets -- it leads to a parsing exception. Let's assume one would want to extend the query parser to work with date-time fields and would want to parse something like:

// another range query but now it has some escaping between the range brackets:
query = queryParser.parse( "[ 2024\\-01\\-01T01\\:01\\:01\\+01\\:00\\[Europe\\/Warsaw\\] TO 2025\\-01\\-01T01\\:01\\:01\\+01\\:00\\[Europe\\/Warsaw\\] ]" );

where all special characters are escaped, leads to:

Caused by: org.apache.lucene.queryparser.classic.ParseException: Encountered " "]" "] "" at line 1, column 50.
Was expecting:
    "TO" ...

    at org.apache.lucene.queryparser.classic.QueryParser.generateParseException(QueryParser.java:1004)
    at org.apache.lucene.queryparser.classic.QueryParser.jj_consume_token(QueryParser.java:867)
    at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:532)
    at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:366)
    at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:251)
    at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:223)
    at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:137)
    ... 4 more

Note, the idea to do range queries for dates is to have something along the lines:

QueryParser queryParser = new QueryParser( "field", analyzer ) {
    @Override
    protected Query newRangeQuery(String field, String part1, String part2, boolean startInclusive, boolean endInclusive) {
        var p1 = parseValue(part1);
        var p2 = parseValue(part2);

        return createRangeQueryForDates( field, p1, p2, startInclusive, endInclusive );
    }
};

but because of the parsing error described above, execution never reaches this point.

Version and environment details

Java version: 17.0.9, vendor: Amazon.com Inc. Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "6.7.10-200.fc39.x86_64", arch: "amd64", family: "unix"

Lucene 9.10.0

benchaplin commented 6 months ago

You can get around this by placing each range term in quotes:

query = queryParser.parse( "[ \"2024\\-01\\-01T01\\:01\\:01\\+01\\:00\\[Europe\\/Warsaw\\]\" TO \"2025\\-01\\-01T01\\:01\\:01\\+01\\:00\\[Europe\\/Warsaw\\]\" ]" );

In fact, then you don't need to escape anything other than the quotes:

query = queryParser.parse( "[ \"2024-01-01T01:01:01+01:00[Europe/Warsaw]\" TO \"2025-01-01T01:01:01+01:00[Europe/Warsaw]\" ]" );

Both will be parsed to [2024-01-01t01:01:01+01:00[europe/warsaw] TO 2025-01-01t01:01:01+01:00[europe/warsaw]].

(I've added some tests showing this: https://github.com/apache/lucene/pull/13323)

marko-bekhta commented 6 months ago

Thanks for looking at this and for the suggestion! I've also tested it out and can confirm that it worked. I'll let you decide how you'd want to proceed with this ticket (looking at the linked PR, you are considering whether an update to the parser should be applied to support more query string variations)