I stumbled across this one in a real-life application, where matches-API based highlighting of a query like this:
field:(a OR b OR c OR d OR ...)
took very long to complete, even though query execution itself is blazing fast. The reason is (I think!) in how the MultiTermQuery handles matches - the AbstractMultiTermQueryConstantScoreWrapper returns a disjunction of iterators from a terms enum:
@Override
public Matches matches(LeafReaderContext context, int doc) throws IOException {
final Terms terms = context.reader().terms(q.field);
if (terms == null) {
return null;
}
return MatchesUtils.forField(
q.field,
() ->
DisjunctionMatchesIterator.fromTermsEnum(
context, doc, q, q.field, q.getTermsEnum(terms)));
}
but for a large set of alternatives, the loop scan inside fromTermsEnum can take a long time until it hits the right document:
static MatchesIterator fromTermsEnum(
LeafReaderContext context, int doc, Query query, String field, BytesRefIterator terms)
throws IOException {
Objects.requireNonNull(field);
Terms t = Terms.getTerms(context.reader(), field);
TermsEnum te = t.iterator();
PostingsEnum reuse = null;
for (BytesRef term = terms.next(); term != null; term = terms.next()) {
if (te.seekExact(term)) {
PostingsEnum pe = te.postings(reuse, PostingsEnum.OFFSETS);
if (pe.advance(doc) == doc) {
return new TermsEnumDisjunctionMatchesIterator(
new TermMatchesIterator(query, pe), terms, te, doc, query);
} else {
reuse = pe;
}
}
}
return null;
}
I've no idea what the fix can be here, just mentioning the problem before I forget it.
Perhaps this wasn't clear - the important bit here is the use of TermInSetQuery (the query parsed substitutes large boolean expressions to this type of query to prevent max-boolean-clauses-exceeded errors).
Description
I stumbled across this one in a real-life application, where matches-API based highlighting of a query like this:
field:(a OR b OR c OR d OR ...)
took very long to complete, even though query execution itself is blazing fast. The reason is (I think!) in how the MultiTermQuery handles matches - the AbstractMultiTermQueryConstantScoreWrapper returns a disjunction of iterators from a terms enum:
but for a large set of alternatives, the loop scan inside fromTermsEnum can take a long time until it hits the right document:
I've no idea what the fix can be here, just mentioning the problem before I forget it.
Version and environment details
No response