Flexible "standard" query parser parses on whitespace [LUCENE-7315]

asfimport commented 8 years ago

Copied from #3679:

The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream. This breaks the following at query-time, because they can't see across whitespace boundaries:

n-gram analysis shingles synonyms (especially multi-word for whitespace-separated languages) languages where a 'word' can contain whitespace (e.g. vietnamese)

Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.

Migrated from LUCENE-7315 by Steven Rowe (@sarowe), 2 votes, updated Jul 20 2016 Attachments: LUCENE-7315.patch Linked issues:

3679

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How does this issue differ from #3679?

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I see: this issue is about making the same fixes in #3679, which was for the classic query parser, to the flexible query parser.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Yes.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

WIP patch against master, generated files not included (ant javacc-flexible in lucene/queryparser/ will generate them), still has nocommits and failing tests.

In addition to enabling not splitting on whitespace prior to text analysis, the patch includes the following changes:

Changed TermQueryNode's positionIncrement name to position, since that's what it really holds.
SynonymQueryNode/Builder now produces a SynonymQuery instead of a boolean query.
Refactored AnalyzerQueryNodeProcessor.postProcessNode() into shorter methods and made it simpler and easier to follow.
Moved split-on-whitespace tests to the shared QueryParserTestBase.

Some challenges remain:

Unlike the classic QP, the flexible standard QP appears to remove a top-level MUST boolean query, e.g. +(word) -> word. Some of the split-on-whitespace shared tests will need to be specialized for each parser.
There's no simple way to collapse the children of the boolean query produced for text containing whitespace when not splitting on whitespace into their ancestor boolean query (if there is one), so some of the shared split-on-whitespace tests are failing.
- The patch includes a FlattenQueryNodeProcessor meant to address this issue, but it's not working and I haven't figured out why yet.
Recent master-only changes will likely make the branch_6x backport non-trivial, e.g #8401.

apache / lucene

Flexible "standard" query parser parses on whitespace [LUCENE-7315] #8369

3679