apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

EliasFano value index [LUCENE-5109] #6173

Closed asfimport closed 11 years ago

asfimport commented 11 years ago

Index upper bits of Elias-Fano sequence.


Migrated from LUCENE-5109 by Paul Elschot, 1 vote, resolved Oct 15 2013 Attachments: LUCENE-5109.patch (versions: 3) Linked issues:

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Mostly untested, not committable.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I don't expect to have time to continue this until half way September, so I'd rather leave this here, maybe someone can pick it up.

The index is a value index on the upper bit numbers. I have called the index a value index, because it would also be possible to add an index the Elias-Fano index of the unary upper bit numbers. Basically a value index indexes the zero upper bits, and an index index would index the one upper bits. (A unary number is a series of zero bits followed by a single one bit, and the number of zero bits determines the value.)

The patch adds ...ValueIndexed.java files that extend EliasFanoEncoder and EliasFanoDecoder. (EliasFanoEncoder is also changed to use longHex from ToStringUtils, just as in #6162 .)

EliasFanoEncoderValueIndexed creates an index by value on the upper bits, and EliasFanoDecoderValueIndexed uses this index in its advanceToValue method.

Both EliasFanoEncoder EliasFanoDecoder have attributes changed from private to protected, so they can be used in their subclasses.

The EliasFanoDocIdSet is changed to use the above value indexed versions. Its testAgainstBitSet has been overriden to be empty (nocommit) because that test currently fails. There are no other tests yet.

The main function of the patch is overriding the method advanceToHighValue in EliasFanoDecoderValueIndexed. The idea is to advance to the high value just before the actual target from the index, and then continue as usual.

Unfortunately the value index does not work yet, the testAgainstBitSet fails. Some tests for the Elias-Fano index itself clearly need to be added first.

Once this index works it is probably better to merge it into EliasFanoEncoder and EliasFanoDecoder because of the speed up the index is expected to provide.

I prefer to start like this because without index things are working nicely now, even with the changes from private to protected.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

The patch of 19 September 2013 implements the EliasFano value index encoding/decoding, and adds some tests that pass.

The following is still to be done:

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for the update, this looks interesting!

use this to implement EliasFanoValueIndexedDocIdSet, test, maybe benchmark

This can be useful to test the overhead of the index compared to EliasFanoDocIdSet but given that we are probably going to want an index almost everytime, maybe we could just make EliasFanoDocIdSet use an index by default, potentially giving the ability to disable indexing by passing indexInterval=Integer.MAX_VALUE (like the other sets).

add broadword bit selection

I'm looking forward to it!

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

I did some simple benchmarking for the DocIdSet. With an index using 256 as index interval for the high values, the memory usage is pretty close to the pfordelta set.

Also the worst case performance of the non indexed version shown here http://people.apache.org/\~jpountz/doc_id_sets.html by the (green) EliasFanoDocIdSet now nicely flattens out for higher bit densities. For example for advance(319993) on my 32 bit machine, the log score compared to FixedBitSet for density 1/10 (-1 logarithmic), is -9.13 for non indexed, and -3.97 for indexed. So basically the indexing works as expected.

given that we are probably going to want an index almost everytime

The simplest thing that can work here is to add the index to the current EliasFanoEncoder and EliasFanoDecoder classes, so I'll do just that, and post a patch for that here.

The default index size with divisor 256 for the high values is small enough to not be bothered about it. Performance may improve by using 128 as a divisor, benchmarking is needed for that. The smaller the divisor, the fewer upper bits need to be visited during a far advance. Roughly, in case the value is present, the number of visited upper bits is 1.5 times the divisor. So for 256 (4 long words), 6 long words are expected to visited. For 128, that would be 3 long words. Vigna also mentions avoiding the index when the advance in the upper bits is less than the index divisor. I have not implemented that.

I also tried adding broadword bit selection. For that I'll open a new issue.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

The issue for broadword bit selection is #6300 .

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Patch of 23 september: as announced yesterday.

I tried benchmarking with index divisor 128 instead of 256. It is indeed a little bit faster for far advanceTo operations.

I used this code snippet in the benchmark to avoid the EliasFanoDocIdSet being used when it is not advisable:

    new DocIdSetFactory() {
      `@Override`
      public DocIdSet copyOf(FixedBitSet set) throws IOException {
        int numValues = set.cardinality();
        int upperBound = set.prevSetBit(set.length() - 1);
        if (EliasFanoDocIdSet.sufficientlySmallerThanBitSet(numValues, upperBound)) {
      final EliasFanoDocIdSet copy = new EliasFanoDocIdSet(numValues, upperBound));
      copy.encodeFromDisi(set.iterator());
      return copy;
    } else {
      return set;
    }
      }
    }

The sufficientlySmallerThanBitSet method currently checks for upperbound/7 > numValues. That used to be a division by 6, I added 1 because the index was added.

Anyway, "advisable" will depend on better benchmarking than I can do...

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

The patch at #6300 also contains the changes here.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Resolved through #6300