Also use EliasFanoDocIdSet in CachingWrapperFilter [LUCENE-5293]

asfimport commented 11 years ago

Migrated from LUCENE-5293 by Paul Elschot, resolved Oct 21 2014 Attachments: LUCENE-5293.patch (versions: 5)

asfimport commented 11 years ago

First patch, 17 Oct 2013, quite rough, one nocommit. The latest benchmark results for doc id sets are here: http://people.apache.org/\~jpountz/doc_id_sets.html

The patch uses EliasFanoDocIdSet for caching when EliasFanoDocIdSet.sufficientlySmallerThanBitSet returns true, which is currently when load factor is at most 1/7, at about -0.85 log10 scale in the benchmark results. Otherwise it uses WAH8DocIdSet, the current behaviour. Does this choice make good use of the benchmark results?

To get the number of doc ids to be put in the cache, the patch checks for the type of the actual DocIdSet that is given, and uses FixedBitSet and OpenBitSet cardinality. (Perhaps a similar method should be added to EliasFanoDocIdSet.) In other cases, the patch falls back to WAH8DocIdSet.

I added a DocIdSet argument to cacheImpl(), there is a nocommit for that.

The patch also corrects a mistake in EliasFanoDocIdSet.sufficientlySmallerThanBitSet, the arguments should be int instead of long, just like the EliasFanoDocIdSet constructor.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I like the idea of using the Elias-Fano doc id set given how it behaves in the benchmarks but it is tricky that it needs to know the size of the set in advance. In practice, the cache impls that you are going to have in CWF.cacheImpl are most likely QueryWrapperFilters, not FixedBitSets or OpenBitSets, so there is no way to know the exact size in advance. We could use DocIdSetIterator.cost but although it is recommended to implement this method by returning an upper bound on the number of documents in the set, it could return any number. Do you think there would be a way to relax the Elias-Fano doc id set building process so that it could be built by providing an approximation of the number of docs in the set (at the cost of some compression loss)?

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

relax the Elias-Fano doc id set building process so that it could be built by providing an approximation of the number of docs in the set (at the cost of some compression loss)

The approximation would have to be a lower bound, i.e. it might be higher than the actual number of documents. The EliasFanoEncoder reserves all the memory it needs at construction time, so the loss in compression would be roughly as noticable as the accuracy of the bound. DocIdSetIterator.cost has another purpose, so it's not worthwhile to use it here I think.

Does the faster build time of the EF DocIdSet (compared to WAH8 and FBS) allow for an extra FBS to be built? That is not immediately clear from the benchmark results, but it could be so.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

DocIdSetIterator.cost has another purpose, so it's not worthwhile to use it here I think.

I was mentioning it because it is the closest thing we have to a cardinality() which is available for every DocIdSet.

Does the faster build time of the EF DocIdSet (compared to WAH8 and FBS) allow for an extra FBS to be built?

bq. That is not immediately clear from the benchmark results, but it could be so.

My guess is that it is faster to build the in-memory doc id sets compared to consuming the (eg. QueryWrapper) filter we need to cache, so indeed, it may allow for building an additional FBS. Since WAH8DocIdSet computes it size (available thorugh cardinality()), maybe we could build a WAH8DocIdSet in any case and replace it with an EF doc id set when there are not many documents ? I can try to update the benchmarks to add the building of an additional FBS before the EF doc id set. This also reminds me that I should look into the building time of the WAH8 doc id set, there are probably things to improve... I'm currently thinking it may be due to the fact that it keeps resizing buffers but I may be completely wrong.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

... the closest thing we have to a cardinality() which is available for every DocIdSet.

For a single term used as a filter there is IndexReader.docFreq(Term), and that does fit in here. (Ideally in this case the posting list could be copied from the index into the cache, but we're not there yet.)

Shall I add a check for QueryWrapperFilter.getQuery() being a TermQuery and then use docFreq() ?

There is also a TermFilter in the queries module. I have not looked at the code yet. Is it necessary to move that to core so a check for that can be used here, too?

... maybe we could build a WAH8DocIdSet in any case and replace it with an EF doc id set when there are not many documents???

For any case in which the cardinality can not easily be determined, that indeed would make sense from the benchmark.

I can try to update the benchmarks to add the building of an additional FBS before the EF doc id set.

Adding a single FBS build to the EF DocIdSet can be visualized in the benchmark for the build times. In that case the EF DocIdSet build result never gets above log(1) = 0, so an update of the benchmarks would not be needed.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Shall I add a check for QueryWrapperFilter.getQuery() being a TermQuery and then use docFreq() ?

I don't like much having support for specific filters based on instanceof calls. I really think the only two options are to consume the filter to cache twice, or to first load into memory in another filter impl and then load it again in an EF doc id set. And I would go for option 2 since option 1 is likely going to be slower?

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I don't like much having support for specific filters based on instanceof calls.

Me neither, but that can be fixed by adding a docFreq() method to DocIdSet, as another hint to be used by CachingWrapperFilter. This method should return the actual number of doc ids in the set when its return value >= 0.

The existing hint in DocIdSet is isCacheable(), its javadocs will need to updated since they need not mention BitSet anymore.

I also prefer option 2, but I'd like to avoid using the other filter impl when possible by using the docFreq hint.

I'll try and make another patch for this.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The existing hint in DocIdSet is isCacheable(), its javadocs will need to updated since they need not mention BitSet anymore.

Good point, I fixed it.

I also prefer option 2, but I'd like to avoid using the other filter impl when possible by using the docFreq hint.

To me this feels like a big change compared to what it gives us. I would prefer having another copy rather than adding this method.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

It felt like a big change when I started, but it was easier than I thought, have a look at the patch of 18 Oct. There are about 35 other places that directly extend DocIdSet or create a new one from an inline subclass, I have not checked these yet.

This passes the current TestCachingWrapperFilter, but there are no tests for this change yet.

For small segments, maxDoc() <= 256, this will use WAH8, would FBS better for those cases?

The last choice for using the EF after the WAH8 was built is done using sufficientlySmallerThanBitSet because that was available, but I'm not really sure whether a smaller load factor should be used there.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Looking again at the benchmark on how to solve building an EF docidset without knowing the number of values in advance, one solution would be to use a PFD docidset for that because it builds quickly and it has good next() performance. The next() will be used once through the set to build the final docidset to be cached.

However an even better way might be to use one or more temporary long arrays to store the incoming doc ids directly in FOR format, (without forming deltas and without an index). This can be done because the maximum doc id value is known. While storing the doc ids, one can switch to an FBS on the fly when the total number of doc ids becomes too high. The existing PackedInts code should be a nice fit for this. Since allocating the long arrays takes time, one can start with one array of say 1/512 of the maximum needed size, and continue into another (bigger) array as long as necessary or until an FBS is preferable.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

After some more thought on this I think using the WA8 docidset as in the patch is the best solution for now, because I think that gives the best building time for the expected cases.

I might add an EliasFanoEncoder constructor with only an upperBound argument for this case. This would leave some room for adding more values (as in ArrayUtil.grow) and it would reorganize the encoded sequence to always use the latest number of values. Reorganizing the encoded sequence would be needed when the number of bits for encoding the lower values changes, and this is floor(log2(upperBound/numValues)) but never negative.

(In a docidset for filtering the upperBound is normally the segment size, and the values are the doc ids.)

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

About the patch of 20 Oct 2013:

There is an EliasFanoEncoder2 with a constructor with only an upperBound. This class delegates to EliasFanoEncoder and has very much the same methods, but not yet a common interface/superclass. It works by growing by a factor of 2 as necessary, and reencoding completely. This reencoding could be optimized for the lower bits by using PackedInts to block decode/encode, but I have not taken the time to implement that.

The EliasFanoDocIdSet has an extra constructor that takes only an upperbound that uses EliasFanoEncoder2.

I ran some tests on this, it appears to work fine here. A possibly misplaced assert in TestEliasFanoDocIdSet is corrected to make the test pass.

CachingWrapperFilter uses the extra EliasFanoDocIdSet constructor when the number of doc ids in the set is not known, and it reverts to FBS when this EliasFanoDocIdSet is not sufficiently smaller than an FBS.

For the rest see my comments on the patch of 18 Oct.

I also did a little bit of perfomance testing, the memory usage of the upperbound only constructor is higher, as expected. The advance/next performance is only slightly less, again as expected. I could not measure build times, I expect it to just about double for the upperbound only constructor.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I also did a little bit of perfomance testing, the memory usage of the upperbound only constructor is higher, as expected.

Instead of keeping the oversized encoder, maybe the encoder could be resized to its actual size when finished encoding? This is what the other (wah8/pfor) sets do.

and it reverts to FBS when this EliasFanoDocIdSet is not sufficiently smaller than an FBS

Maybe pfor/wah8 would still be useful here since they efficiently encode almost full sets?

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Instead of keeping the oversized encoder, maybe the encoder could be resized to its actual size when finished encoding? This is what the other (wah8/pfor) sets do.

I suppose you mean by recomputing the actually needed sizes for the long arrays of the encoder after all encoding is done, and then reallocating them? That would certainly be possible. I'll have a look at wah8/pfor for this.

Maybe pfor/wah8 would still be useful here since they efficiently encode almost full sets?

The problem is that none of the compressing sets is as good as FBS at advancing far in almost full sets.

I'm working on this now, another patch is slowly on its way.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

About the patch of 21 Oct:

Adds EliasFanoEncoderUpperBound (instead of EliasFanoEncoder2) with only an upperBound argument to the constructor. Adds a freeze() method to EliasFanoEncoder to reallocate to actual size. Both are used in EliasFanoDocIdSet, which now also reverts to using an FBS as needed.

Adapted TestEliasFanoDocIdSet to use EliasFanoDocIdSet randomly half of the time with a -1 numValues so it may use EliasFanoEncoderUpperBound instead of EliasFanoEncoder at first.

For the rest see my remarks about the patch of 20 Oct.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

On the patch of 22 Oct: The previous patch contains a bug in the freeze() code, it allocates more than an FBS size to one of the encoding arrays. This should fix it.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I've done some more performance measurements of this EliasFanoDocIdSet that allows only an upperBound to its constructor. No surprising results but I'd like add an APL2 to the benchmark program that was derived from the (Test)DocIdSetBenchmark at #6300 and #6165 and post it. Adrien, can I assume an APL 2 on the first DocIdSetBenchmark at #6165?

The patch here is getting a little overloaded, so shall I open a separate issue for the EliasFanoDocIdSet that allows only an upperBound to its constructor?

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

I opened a github pull request for the latest patch: https://github.com/apache/lucene-solr/pull/19

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

Since #7045 CachingWrapperFilter uses RoaringDocIdSet. Are there any advantages for EliasFanoDocIdSet left for use there?

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Although the elias-fano set is smaller in the sparse case, it's true that RoaringDocIdSet tends to be faster to build and to iterate on. So overall I think the RoaringDocIdSet has a better trade-off indeed.

asfimport commented 9 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/19

apache / lucene

Also use EliasFanoDocIdSet in CachingWrapperFilter [LUCENE-5293] #6357