apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Maybe a DocIdSetIterator may implement Bits? [LUCENE-8386] #9433

Open asfimport opened 6 years ago

asfimport commented 6 years ago

I was looking at ConjunctionDISI and noted the special case logic for DISI's of type BitSetIterator. It seems to only need the more minimal Bits interface though it makes references to BitSet specifically. BitSetIterator is a concrete class; it would be nice if a DISI could either implement an optional interface to expose a Bits or perhaps implements Bits directly. This would allow other/custom DISIs that can implement a Bits quickly without being forced to use BitSetIterator specifically. Even DocIdSetIterator.all(...) could implement this.


Migrated from LUCENE-8386 by David Smiley (@dsmiley), updated Jul 09 2018

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Do you have concrete ideas in mind? I see BitSetIterator as an exception, other Bits-based iterators don't perform well and are better implemented with two-phase iterators?

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

First a disclaimer: I was merely looking at the code; this wasn't driven by any debugging session or profiling.

I see BitSetIterator as an exception, other Bits-based iterators don't perform well

I'm confused. Why would BitSetIterator specifically be better than some other hypothetical Bits-based iterator? A hypothetical Bits-based iterator of course would be based on some fast bit set, though through API reasons isn't necessarily a BitSetIterator specifically.

Bits-based iterators don't perform well and are better implemented with two-phase iterators

That surprises me; I would imagine it would depend on the cardinality of the bits since a low-cardinality bit set ought to potentially lead iteration? For an extremely dense case; I'm not sure how much it'd matter either way since they are cheap to work with.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Sorry for introducing some confusion, when I mentioned Bits-based iterators I was thinking of a DocIdSetIterator over a Bits instance that would check bits one by one to find the next match rather than an alternative to BitsetIterator.

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I think I see what you're getting at due to my first mention of Bits vs BitSet; maybe we confused each other ;) 

I'm pointing that ConjunctionDISI makes a special optimization for any of it's input DISI's of type BitSetIterator, and I think that's a shame since someone might have a similar DISI that is not necessarily a BitSetIterator precisely.  From that observation, I hypothesized if a DISI might expose an optional Bits somehow, then ConjunctionDISI could do it's optimization more generically.