apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.04k forks source link

Flexible "standard" query parser parses on whitespace [LUCENE-7315] #8369

Open asfimport opened 8 years ago

asfimport commented 8 years ago

Copied from #3679:

The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream. This breaks the following at query-time, because they can't see across whitespace boundaries:

n-gram analysis shingles synonyms (especially multi-word for whitespace-separated languages) languages where a 'word' can contain whitespace (e.g. vietnamese)

Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.


Migrated from LUCENE-7315 by Steven Rowe (@sarowe), 2 votes, updated Jul 20 2016 Attachments: LUCENE-7315.patch Linked issues:

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How does this issue differ from #3679?

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I see: this issue is about making the same fixes in #3679, which was for the classic query parser, to the flexible query parser.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Yes.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

WIP patch against master, generated files not included (ant javacc-flexible in lucene/queryparser/ will generate them), still has nocommits and failing tests.

In addition to enabling not splitting on whitespace prior to text analysis, the patch includes the following changes:

Some challenges remain: