apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.39k stars 950 forks source link

Add test for parsing brackets in range queries #13323

Open benchaplin opened 3 weeks ago

benchaplin commented 3 weeks ago

Description

I've added some unit tests to protect/display the QueryParser's ability to parse brackets in range query terms when in quotes.

This issue raised a question about the QueryParser's ability to handle brackets in a range query's terms, even when escaped, for example:

queryParser.parse( "[ a\\[i\\] TO b\\[i\\] ]" );

/* 
org.apache.lucene.queryparser.classic.ParseException: Cannot parse '[a\[i\] TO b\[i\]]': Encountered " "]" "] "" at line 1, column 6.
Was expecting:
    "TO" ...
 */

You can get around this exception by wrapping each range query term in quotes:

queryParser.parse( "[\"a[i]\" TO \"b[i]\"]" );

This is because of this JavaCC grammar rule. <RANGE_GOOP> cannot handle a closing bracket, even an escaped one, while <RANGE_QUOTED> can.

I've added some unit tests to protect this functionality.


Any thoughts on whether queryParser.parse( "[ a\\[i\\] TO b\\[i\\] ]" ); should work? I did play around with updating the grammar, it should be possible but I need to think more about how it might break backwards compatibility.

dweiss commented 3 weeks ago

Any thoughts on whether queryParser.parse( "[ a\[i\] TO b\[i\] ]" ); should work? I did play around with updating the grammar, it should be possible but I need to think more about how it might break backwards compatibility.

Looking at the javacc file it seems range terms are parsed differently compared to normal tokens. This is strange but I bet there was a reason for it - now long forgotten. I think, logically, it should accept the same term notation that is used elsewhere. It will break backward compatibility but maybe consistency is worth this cost (for the major release)?

benchaplin commented 3 weeks ago

I was thinking a fix like this could work:

| <RANGE_GOOP:   (~[ " ", "]", "}" ] | "\\ " | "\\]" | "\\}" )+ >

simply allowing the range term to continue parsing through an escaped space or closing bracket.

"Breaking" changes:

dweiss commented 3 weeks ago

The joys of escaping... Never easy (hello, Windows command-line users...). You may want to add a randomized test that constructs those terms using a mix of characters and "allowed" escape sequences - a corner case will pop up quickly, I think.

github-actions[bot] commented 6 days ago

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!