apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Context query with regex "." produces an assertion failure [LUCENE-8288] #9335

Open asfimport opened 6 years ago

asfimport commented 6 years ago

When a RegexCompletionQuery of "." is provided to ContextQuery, the following assertion failure occurs:

java.lang.AssertionError: input should not end with a context separator followed by SEP_LABEL

at org.apache.lucene.search.suggest.document.ContextQuery$ContextCompletionWeight.setInnerWeight(ContextQuery.java:299)
at org.apache.lucene.search.suggest.document.ContextQuery$ContextCompletionWeight.setNextMatch(ContextQuery.java:275)
at org.apache.lucene.search.suggest.document.NRTSuggester.lookup(NRTSuggester.java:221)
at org.apache.lucene.search.suggest.document.CompletionScorer.score(CompletionScorer.java:70)
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
at org.apache.lucene.search.suggest.document.SuggestIndexSearcher.suggest(SuggestIndexSearcher.java:78)
at org.apache.lucene.search.suggest.document.SuggestIndexSearcher.suggest(SuggestIndexSearcher.java:58)
at org.apache.lucene.search.suggest.document.TestContextQuery.testDotRegexQuery(TestContextQuery.java:188)

Note that this is a related, but distinct issue from #9334, where the RegexCompletionQuery is empty.

The attached patch provides a reproduction of the issue, as the test case TestContextQuery#testRegexDotQuery. To reproduce, Java assertions must be enabled (as in the default configuration for tests).

The patch also provides a test case for the normal behavior of an empty RegexCompletionQuery, when it is not wrapped in ContextQuery (TestRegexCompletionQuery#testRegexDotQuery). In this case, there is no error, and all suggestions are returned.

From a quick look, it seems as though "." doesn't capture any characters past  CompletionAnalyzer.SEP_LABEL, so the matching prefix in ContextCompletionWeight#setInnerWeight is unexpectedly empty.


Migrated from LUCENE-8288 by Julie Tibshirani (@jtibshirani), updated May 07 2018 Attachments: LUCENE-8288.patch (versions: 2), LUCENE-8288-repro.patch

asfimport commented 6 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Since it is possible to index suggestions with or without separators (preservePositionIncrements) the context query adds an optional separator after the context automaton. This character is optional so the regex "." can match the context plus the separator label but nothing from the real suggestions. Completion queries should always match a prefix from the suggestions (hence the assertion) but it doesn't handle regex that starts with ".". I've attached a patch to fix the issue that adds a parameter in the ContextQuery constructor to indicate if suggestions are indexed with position increments or not.This is a breaking change since it requires to match the value used for indexing but I don't see how to do it differently if we want to match regex that starts with any character accurately (e.g.: ".[s|t]").

asfimport commented 6 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Thank you @jimczi! I'm looking forward to this fix going in, as we're hoping to use a "." regex completion query as part of an elasticsearch enhancement (https://github.com/elastic/elasticsearch/issues/19147),).

asfimport commented 6 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I attached a new patch that fixes the name of the option (preserveSep is the correct option, preservePositionIncrements is for keeping holes when tokens are removed by a filter). I'll commit next week if there is no objection.