make it easier to plugin different bitset implementations to CachingWrapperFilter [LUCENE-5101]

asfimport commented 11 years ago

Currently this is possible, but its not so friendly:

  protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) throws IOException {
    if (docIdSet == null) {
      // this is better than returning null, as the nonnull result can be cached
      return EMPTY_DOCIDSET;
    } else if (docIdSet.isCacheable()) {
      return docIdSet;
    } else {
      final DocIdSetIterator it = docIdSet.iterator();
      // null is allowed to be returned by iterator(),
      // in this case we wrap with the sentinel set,
      // which is cacheable.
      if (it == null) {
        return EMPTY_DOCIDSET;
      } else {
/* INTERESTING PART */
        final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
        bits.or(it);
        return bits;
/* END INTERESTING PART */
      }
    }
  }

Is there any value to having all this other logic in the protected API? It seems like something thats not useful for a subclass... Maybe this stuff can become final, and "INTERESTING PART" calls a simpler method, something like:

protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) {
  final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
  bits.or(iterator);
  return bits;
}

Migrated from LUCENE-5101 by Robert Muir (@rmuir), 2 votes, resolved Sep 06 2013 Attachments: DocIdSetBenchmark.java, LUCENE-5101.patch (versions: 4)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I suppose taking DocIDSet is more flexible, in case you can build it in a faster way. But it would be good to make this simpler (less boilerplate) without losing the flexibility.

Maybe a bigger change is needed: we have quite a few impls now and CWF is messy: for example I think EMPTY_DOCIDSET should not exist here in CWF, but instead be DocIDSet.EMPTY...

Anyway this is just to hopefully encourage discussion about how we can make it easier to use the new impls :)

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 The two new impls at least can be constructed from a DocIdSetIterator.

Maybe we should have another protected method for what to do in case the doc ID set returned by the filter is already cacheable (by default it would return the set directly). Some filters return doc id sets which are already cacheable (TermsFilter for example) and it could be useful to make sure a CWF always returns the same impl (for instance block joins want a FixedBitSet).

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

A quick note about alternative DocIdSet we now have. I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/\~jpountz/doc_id_sets.html Please note that EliasFanoDocIdSet is disadvantaged for advance() since it doesn't have an index yet, it will be interesting to run this benchmark again when it gets one.

Maybe we could use these numbers to have better defaults in CWF? (and only use FixedBitSet for dense sets for example)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe we could use these numbers to have better defaults in CWF? (and only use FixedBitSet for dense sets for example)

+1: we should have better defaults. Ideally we would use DISI.cost() to estimate the sparsity.

One problem is a lot of the costly filters that people want to cache have a crap cost() implementation. e.g. MultiTermQueryWrapperFilter could instead getAndSet() and return a DISI with an actual accurate cost().

Or instead for now, we could also check firstDocID too...

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I was thinking about a heuristic like this...

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/\~jpountz/doc_id_sets.html

Thank you very much, plenty of dilemma's ahead. Do WAH8 and PFOR already have an index? With an index, each of these, including Elias-Fano, should have about constant access time when advancing far enough. What that constant time will be is still open.

Block decoding might still be added to EliasFano, which should improve its nextDoc() performance, but I have no idea by how much. See also at #3824 for Kamikaze PFOR. The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR.

Another surpise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then.

For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Do WAH8 and PFOR already have an index?

They do, but the index is naive: it is a plain binary search over a subset of the (docID,position) pairs contained in the set. With the first versions of these DocIdSets, I just wanted to guarantee O(log(cardinality)) advance performance.

Block decoding might still be added to EliasFano, which should improve its nextDoc() performance

The main use-case I see for these sets is to be used as filters. So I think advance() performance is more important?

The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR.

Well, the PFOR doc ID set is not tuned either. :-) But I agree this is a good surprise for the Elias-Fano set. I mean even the WAH8 doc id set should be pretty fast and is still slower than the Elias-Fano set.

Another surprise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then.

I'm looking forward for the index. :-)

For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper?

I guess so.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I had another look at the recent benchmark results and something does not seem in order there.

At density -2 (1%), Elias-Fano is faster at advance(docID() +1) (2.45 times fixed) than at nextDoc() (1.81 times fixed), and I would expect that FixedBitSet would have an almost equal run times for advance(docId()+1) and nextDoc().

The code for advance (advanceToValue in EliasFanoDecoder) is really more complex than the code for nextDoc (nextValue in EliasFanoDecoder) and the code at EliasFanoDocIdSet is so simple that it should not really influence things here. So for EliasFanoDocIdSet advance(docId() + 1) should normally be slower than nextDoc(), but the benchmark contradicts this.

Could there be a mistake in the benchmark for these cases? Or is this within expected (JIT) tolerances?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Well spotted. Maybe I did a mistake when moving the data from the benchmark output to the charts. I modified the program so that it outputs directly the input of the charts. See the updated charts at http://people.apache.org/\~jpountz/doc_id_sets.html. I also modified it so that memory uses a log scale too.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I simplified the benchmark somewhat and ran it on my 32 bit home machine.

The relative results for EliasFanoDocIdSet are very similar to the ones above, so this benchmark is a good measuring tool for me. I'll use it to verify performance of an index once it is working, #6173.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I just updated http://people.apache.org/\_{jpountz/doc_id_sets.html based on the latest updated on WAH8DocIdSet and PFORDeltaDocIdSet and added the building time to the charts. PFORDeltaDocIdSet now better compresses doc id sets of medium load factors (\} 0.7) by switching to unary coding when it saves space compared to PFor. This way, it never grows much larger than a FixedBitSet. Similarly, WAH8DocIdSet now better compresses dense sets (#6214).

I wish I had more time to work on the EliasFanoDocIdSet to add an index to it and see how it behaves. It has interesting characteristics!

Regarding this issue, I think we could update the patch to always use WAH8DocIdSet? I think the most interesting benchmark is advance(31) since it maps to a Scorer that performs leap frog between a query and a filter that both contain lots of documents (hence the query execution is slow) and WAH8DocIdSet is the one which has the best worst case here?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think that would be a reasonable default.

I still think it would be good to improve the API in a way that makes it easier for someone to add their own heuristics/logic/subclasses and use different implementations. Its just unfortunate there is some boilerplate and cornercases they have to worry about (null DocIDSet, null Iterator, etc etc).

The way the patch does it is just one possibility (we could also just move forward with a better default and not change the API, too).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Updated patch that just defaults to compressing but with the same protected method. I also added some tests to TestCWF (for the corner cases of the boilerplate stuff).

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Patch looks good to me. Since join users are going to need fixed bit sets, maybe we should have a FixedBitSetCachingWrapperFilter so that they don't need to instantiate an anynous class to override cacheImpl every time?

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Patch looks good to me, too.

Hopefully we'll get some early feedback about performance.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I iterated on Robert's patch to fix failures in the join module which were due to the fact that CWF doesn't always produce FixedBitSets anymore.

I added a new FixedBitSetCachingWrapperFilter in oal.search.join which extends CWF to always produce FixedBitSets, as required by To(Parent|Child)BlockJoinQuery.

All tests passed.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

did you forget to 'svn add'?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Indeed! Here is a complete patch.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can the FixedBitSetCachingWrapperFilter.java just override cacheImpl instead of all the boilerplate? :)

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This CWF is a bit different in that if it gets a cacheable filter which is not a FixedBitSet, it will convert it to a FixedBitSet instead of keeping it as-is. I think this is required for instance if users have custom filter implementations that return cacheable filters which are not a FixedBitSet (some filters return cacheable doc id sets, for instance TermsFilter could be backed by something else than a FixedBitSet in the future).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok, makes sense: thanks. maybe we could add this to the docs or in a code comment?

patch looks good to me.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Good point, here is an updated patch with an additional comment explaining the difference with CWF.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1520525 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1520525

LUCENE-5101: Make it easier to plugin different bitset implementations to CachingWrapperFilter.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1520527 from @jpountz in branch 'dev/branches/branch_4x' https://svn.apache.org/r1520527

LUCENE-5101: Make it easier to plugin different bitset implementations to CachingWrapperFilter.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Committed, thanks Robert!

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

4.5 release -> bulk close

apache / lucene

make it easier to plugin different bitset implementations to CachingWrapperFilter [LUCENE-5101] #6165