apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

QueryParser with wildcard search does not use Analyzer's tokenizer [LUCENE-7437] #8489

Open asfimport opened 8 years ago

asfimport commented 8 years ago

Using a tokenizer that splits at underscores (e.g. SimpleAnalyzer) splits "qwert_asdfghjkl" into two words at the time of indexing.

Searches for "qwert asdf*" or "qwert_asdfghjkl" work as expected.

However, when a query contains wildcards, e.g. "qwert_asdf*" the query parser does not use the tokenizer of its analyzer to split the words and thus finds no result.


Migrated from LUCENE-7437 by Michael Pichler, 1 vote Attachments: LuceneTest.java

asfimport commented 8 years ago

Michael Pichler (migrated from JIRA)

The attached test program uses a SimpleAnalyzer (which has a LowerCaseTokenizer) to add a field value "qwert_asdfghjkl", which produes two words.

All test searches work as expected, except those where special characters (like underscore or percent sign) are used together with a wildcard (asterisk).

Without wildcard the query is tokenized as expected (thus whitespace and separator characters work in the query), however with wildcards the query is no longer tokenized and only whitespaces between searched words/prefixes yield a search result.

search: 'qwert asdfghjkl', query: '+f:qwert +f:asdfghjkl', #hits: 1
search: 'qwert_asdfghjkl', query: '+f:qwert +f:asdfghjkl', #hits: 1
search: 'qwert%asdfghjkl', query: '+f:qwert +f:asdfghjkl', #hits: 1
search: 'qwert asdf*', query: '+f:qwert +f:asdf*', #hits: 1
search: 'qwert_asdf*', query: 'f:qwert_asdf*', #hits: 0
  ^^^ expected 1 hit(s), got 0
search: 'qwert%asdf*', query: 'f:qwert%asdf*', #hits: 0
  ^^^ expected 1 hit(s), got 0
asfimport commented 8 years ago

Michael Pichler (migrated from JIRA)

This problem with underscores and wildcards seems to exist for quite a while, see also: http://stackoverflow.com/questions/3458221/lucene-net-search-and-underscore

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi,

this is a known problem and not solveable by default, because many Analyzers are not working with wildcards, e.g. if stemming is involved. If you know that your analysis is not breaking the wildcard expansion, you can use AnalyzingQueryParser, which is a subclass of the classic queryparser that does special processing of wildcards, ranges, and fuzzy: https://lucene.apache.org/core/6_2_0/queryparser/org/apache/lucene/queryparser/analyzing/AnalyzingQueryParser.html

asfimport commented 8 years ago

Michael Pichler (migrated from JIRA)

Hello Uwe, thanks for the quick reply!

As we do not use stemming on indexing, this would be a suitable option for us (in fact we already override getPrefixQuery and getFuzzyQuery in our custom QueryParser like done in the AnalyzingQueryParser).

I tried to use AnalyzingQueryParser in the test program and got the same result (thus I re-opened the issue). Even though the analyzer gets involved for normalization, it is still not used for tokenizing the input.

    // QueryParser queryParser = new QueryParser(FIELD_NAME, analyzer);
    QueryParser queryParser = new AnalyzingQueryParser(FIELD_NAME, analyzer);
    System.err.println("using QueryParser " + queryParser.getClass());
using QueryParser class org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser
search: 'qwert asdf*', query: '+f:qwert +f:asdf*', #hits: 1
search: 'qwert_asdf*', query: 'f:qwert_asdf*', #hits: 0
  ^^^ expected 1 hit(s), got 0
search: 'qwert%asdf*', query: 'f:qwert%asdf*', #hits: 0
  ^^^ expected 1 hit(s), got 0