apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 973 forks source link

qweight.matches(LeafReaderContext ctx, int doc) can be prohibitively slow for large TermInSet queries #13391

Open dweiss opened 1 month ago

dweiss commented 1 month ago

Description

I stumbled across this one in a real-life application, where matches-API based highlighting of a query like this:

field:(a OR b OR c OR d OR ...)

took very long to complete, even though query execution itself is blazing fast. The reason is (I think!) in how the MultiTermQuery handles matches - the AbstractMultiTermQueryConstantScoreWrapper returns a disjunction of iterators from a terms enum:

    @Override
    public Matches matches(LeafReaderContext context, int doc) throws IOException {
      final Terms terms = context.reader().terms(q.field);
      if (terms == null) {
        return null;
      }
      return MatchesUtils.forField(
          q.field,
          () ->
              DisjunctionMatchesIterator.fromTermsEnum(
                  context, doc, q, q.field, q.getTermsEnum(terms)));
    }

but for a large set of alternatives, the loop scan inside fromTermsEnum can take a long time until it hits the right document:

  static MatchesIterator fromTermsEnum(
      LeafReaderContext context, int doc, Query query, String field, BytesRefIterator terms)
      throws IOException {
    Objects.requireNonNull(field);
    Terms t = Terms.getTerms(context.reader(), field);
    TermsEnum te = t.iterator();
    PostingsEnum reuse = null;
    for (BytesRef term = terms.next(); term != null; term = terms.next()) {
      if (te.seekExact(term)) {
        PostingsEnum pe = te.postings(reuse, PostingsEnum.OFFSETS);
        if (pe.advance(doc) == doc) {
          return new TermsEnumDisjunctionMatchesIterator(
              new TermMatchesIterator(query, pe), terms, te, doc, query);
        } else {
          reuse = pe;
        }
      }
    }
    return null;
  }

I've no idea what the fix can be here, just mentioning the problem before I forget it.

Version and environment details

No response

dweiss commented 1 month ago

Perhaps this wasn't clear - the important bit here is the use of TermInSetQuery (the query parsed substitutes large boolean expressions to this type of query to prevent max-boolean-clauses-exceeded errors).