Open easyice opened 8 months ago
I'm not sure I like this idea, which feels quite arbitrary: why would there be a big gap of matching doc IDs towards the start of the doc ID space, and not in the middle or towards the end?
This BitSetConjunctionDISI
was mostly added to mimic the way FilteredQuery
used to work (before we replaced filters with queries and FilteredQuery
with FILTER
clauses on BooleanQuery
) to avoid performance regressions. Maybe we can improve the heuristic regarding when to use it, which could in-turn help the case you're looking into? For instance we currently evaluate filters via Bits#get
rather than BitSet#nextSetBit
when their cost is greater than the minimum cost across all required clauses. In a similar context, IndexOrDocValuesQuery
only uses doc values to run queries when their cost is 8x greater than the lead cost, maybe it would be a better threshold for evaluating filters via Bits#get
as well?
@jpountz Thanks very much for the guidance! Your explanation makes sense to me, I'll try this idea :)
Description
In
BitSetConjunctionDISI
, we iterate over theDocIdSetIterator
of lead, and then lookup the BitSets to get conjunction. currently, we usebitSet.length()
to limit the termination of the iteration, the similarly, maybe we can consider usebitSet.nextSetBit(0)
to set the start doc of the iteration. some scenarios may have more zeros at the beginning of the BitSet, for instance, the retention query in Elasticsearch using an increasing number field(_seq_no), it may save the result to BitSet with more leading zeros.