apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Advance to first position of 1 in BitSet before iterating the lead in BitSetConjunctionDISI? #13024

Open easyice opened 8 months ago

easyice commented 8 months ago

Description

In BitSetConjunctionDISI, we iterate over the DocIdSetIterator of lead, and then lookup the BitSets to get conjunction. currently, we use bitSet.length() to limit the termination of the iteration, the similarly, maybe we can consider use bitSet.nextSetBit(0) to set the start doc of the iteration. some scenarios may have more zeros at the beginning of the BitSet, for instance, the retention query in Elasticsearch using an increasing number field(_seq_no), it may save the result to BitSet with more leading zeros.

jpountz commented 7 months ago

I'm not sure I like this idea, which feels quite arbitrary: why would there be a big gap of matching doc IDs towards the start of the doc ID space, and not in the middle or towards the end?

This BitSetConjunctionDISI was mostly added to mimic the way FilteredQuery used to work (before we replaced filters with queries and FilteredQuery with FILTER clauses on BooleanQuery) to avoid performance regressions. Maybe we can improve the heuristic regarding when to use it, which could in-turn help the case you're looking into? For instance we currently evaluate filters via Bits#get rather than BitSet#nextSetBit when their cost is greater than the minimum cost across all required clauses. In a similar context, IndexOrDocValuesQuery only uses doc values to run queries when their cost is 8x greater than the lead cost, maybe it would be a better threshold for evaluating filters via Bits#get as well?

easyice commented 7 months ago

@jpountz Thanks very much for the guidance! Your explanation makes sense to me, I'll try this idea :)