EliasFanoDocIdSet [LUCENE-5084]

asfimport commented 11 years ago

DocIdSet in Elias-Fano encoding

Migrated from LUCENE-5084 by Paul Elschot, 2 votes, resolved Jul 09 2013 Attachments: LUCENE-5084.patch (versions: 2) Linked issues:

6116

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

For the patch of 30 June 2013:

This consists of the class EliasFanoDocIdSet and the two classes that it uses, EliasFanoEncoder and EliasFanoDecoder. The last two are implemented on long, EliasFanoDocIdSet does the casting to and from int for DocIdSet.

There are various ways in which the decoding speed could still be improved: Addition of an index on the high bits, see the Vigna paper. Use of broadword bit searching, actually better done at another issue, this uses Long method bitCount and numberOfTrailingZeros. I have not yet profiled for performance bottlenecks.

The decoder is not really complete, it has an advanceToIndex method but no backToIndex method yet.

Nevertheless this is usable now because the compression works and the linear searches that are done (because of the lack on indexing) will access no more than roughly 3N bits, where N is the number of doc ids in the set, and FixedBitSet.nextDoc() can (theoretically) access a number of bits equal to the number of docs in a segment.

TestEliasFanoSequence tests EliasFanoEncoder and EliasFanoDecoder.

TestEliasFanoDocIdSet tests EliasFanoDocIdSet.

I have used package o.a.l.util.eliasfano, this could be changed to o.a.l.util.packed for example.

There is a NOCOMMIT for a static longHex method that dumps a long in fixed width hex format, is there a better place for this method?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I have not dug much through the code but I tested it against various randomly-generated sets with numDocs=10M, and the compression looks great:

Load	FixedBitSet	WAH8DocIdSet(#6145)	EliasFanoDocIdSet(this issue)	PForDeltaDocIdSet(from kamikaze, LUCENE-2750)
0.001%	1.2 MB	424 bytes	344 bytes	9 KB
0.01%	1.2 MB	3.4 KB	2 KB	10.6 KB
0.1%	1.2 MB	28.4 KB	14.7 KB	25.1 KB
1%	1.2 MB	223.2 KB	104.6 KB	132.3 KB
10%	1.2 MB	1 MB	641 KB	860.5 KB
30%	1.2 MB	1.2 MB	1.3 MB	1.9 MB
50%	1.2 MB	1.2 MB	1.8 MB	2.7 MB
70%	1.2 MB	1.2 MB	2 MB	3 MB
90%	1.2 MB	1.2 MB	2.3 MB	3.1 MB

I especially like the fact that it saves almost half the memory even for pretty large sets that contain 1/10th of all doc IDs.

I have used package o.a.l.util.eliasfano, this could be changed to o.a.l.util.packed for example.

Indeed maybe we don't need a dedicated package for this DocIdSet. oal.util.packed would be fine I think.

There is a NOCOMMIT for a static longHex method that dumps a long in fixed width hex format, is there a better place for this method?

I think it is OK to leave it here.

I'll try to dig more thoroughly into the patch in the next few days...

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Coo bitset :-)

What's up with the kamikaze bitset, also an option?

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

That was fast :) The results posted above seem to be in line with the formula below, and that is quite nice to see.

The upper bound for the size in bits per encoded number in an EliasFanoDocIdSet is:

2 + ceil(2log(upperBound/numValues))

and a few constant size objects also have to be added in there.

Please note that there is no index yet. The index will be relatively small, for example for the 10% case above with 641 kB I expect an index size of about 12 kB, adding about 2% to the size. This index will consist of N/256 entries of a single number with max value 3N, i.e. ceil(2log(3N)) bits per index entry.

The code posted here is still young. So even though it has some test cases, I'd like be reassured that in the code that produced the posted results, there is at least a basic test that verifies that all input docs are available after compression. Is that the case?

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Uwe,

The EliasFanoDocIdSet is not a bitset, but it is definitely cool, please have a look at the Vigna article I mentioned at #6145.

The upper bound above on the bits per encoded number is valid for every sorted permutation of non negative whole numbers bounded by upperBound, and this size bound is no more than half a bit larger than the smallest possible representation.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Please note that there is no index yet. The index will be relatively small, for example for the 10% case above with 641 kB I expect an index size of about 12 kB, adding about 2% to the size. This index will consist of N/256 entries of a single number with max value 3N, i.e. ceil(2log(3N)) bits per index entry.

This sounds good!

The code posted here is still young. So even though it has some test cases, I'd like be reassured that in the code that produced the posted results, there is at least a basic test that verifies that all input docs are available after compression. Is that the case?

Yes, I checked that all doc ID sets have the same content.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

What's up with the kamikaze bitset, also an option?

I'll take some time to evaluate all the options we have for compressed doc ID sets, and the kamikaze implementation is one of them.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

The number of index entries may actually be 2N/256 instead of N/256. That would be about 4% size in the case above.

I am about to implement such an index, but my tempo is going to be slow. So in case anyone can do that faster...

asfimport commented 11 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Sorry if this sounds uninformed, but what is the CPU tradeoff for all these compressions and how does query speed compare across all of them? What is the logic behind never looking at that when making memory comparisons? I skimmed the paper Paul references and saw the main comparisons were really more about query speed and less about memory savings. Thanks.

asfimport commented 11 years ago

Emmanuel Espina (migrated from JIRA)

Have you considered creating a PostingFormat with this? I was thinking in something like DirectPostingsFormat but instead of using an array of ints for storing the docIds using an Elias-Fano compressed bit stream. Facebook is using an in-memory index for its graph search and they are compressing the posting lists with Elias Fano (https://www.facebook.com/download/138915572976390/UnicornVLDB-final.pdf)

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think #6116 will explore that (using a bitset to represent DOCS_ONLY postings)...

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Have you considered creating a PostingFormat with this? I was thinking in something like DirectPostingsFormat but instead of using an array of ints for storing the docIds using an Elias-Fano compressed bit stream.

The Vigna paper is all about posting formats. Because of this I first implemented an encoder and a decoder in a long format, and then used these here for a DocIdSet that works on int.

For a postings format, the encoder would need an additional constructor from index data. That might involve merging the currently separate long arrays for high bits and low bits into a single array.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

... what is the CPU tradeoff for all these compressions and how does query speed compare across all of them?

We don't really know. One conclusion from the Vigna paper is that a tuned implementation of an Elias-Fano decoder is faster than a tuned PForDelta implementation for highly selective phrase queries. I would guess that that is because Elias-Fano uses random access to the low bits where PForDelta only uses bulk decompression of the low bits, which compensates for Elias-Fano decoding its high bits somewhat slower than PForDelta can decode its exceptions.

One reason to use Elias-Fano for a DocIdSet here is that its high bits are encoded in unary coding which can easily be decoded in two directions, and that makes it useful for block joins. The other reason is that its compression is quite good, which makes it a nice candidate for in memory filters.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I had a deeper look at the patch, here are some thoughts about it:

the patch looks great, really, I especially like all the integrity checks!
after having read more of it, I think moving it to oal.util.packed makes perfect sense
the documentation mentions the fact that a FixedBitSet will be more storage efficient if numValues >= (upperBound/3), maybe we should have a static utility method to check that so that consumers of this API can opt for a FixedBitSet if their doc set is going to be dense?
in EliasFanoEncoder constructor, the ceil of the log in base 2 is computed through a loop, can we use Integer.numberOfLeadingZeros instead (as explained in http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#numberOfLeadingZeros(int) )
I think it would make sense to use PackedInts.getMutable to store the low-order bits instead of a raw long[] (this abstraction will hide the complexity of the bit packing or even use a more appropriate structure, eg. if the number of bits required is 8, it will use a byte[])
I'm not sure but shouldn't the iterator's getCost method return efDecoder.numValues instead of efEncoder.numValues?

Even though storing longs is more general, it forces us to use a different value to encode NO_MORE_DOCS which adds a little complexity to the DocIdSet implementation. Maybe we could just support the encoding of monotonically increasing sequences of ints to make things simpler? (just thinking out loud).

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

maybe we should have a static utility method to check that so that consumers of this API can opt for a FixedBitSet if their doc set is going to be dense?

We could, but in which class? For example, in CachingWrapperFilter it might be good to save memory, so it could be there. Also, would the expected size be the only thing to check for? When decoding speed is also important, other DocIdSets might be preferable.

the ceil of the log in base 2 is computed through a loop

numberOfLeadingZeros is indeed better than a loop. We need the Long variant here.

use PackedInts.getMutable to store the low-order bits instead of a raw long[]

Can PackedInts.getMutable also be used in a codec? Longs are needed for the high bits, see below, and the high and low bits can be conveniently stored next to each other in an index.

shouldn't the iterator's getCost method return efDecoder.numValues instead of efEncoder.numValues?

Yes.

Maybe we could just support the encoding of monotonically increasing sequences of ints to make things simpler?

I considered a decoder that returns ints but that would require a lot more casting in the decoder. Decoding the unary encoded high bits is best done on longs, so mixing longs and ints in the encoder is not really an option. We could pass the actual NO_MORE_VALUES to be used as an argument to the decoder, would that help?

As to why decoding the unary encoded high bits is best done on longs, see Algorithm 2 in "Broadword Implementation of Rank/Select Queries", Sebastiano Vigna, January 30, 2012, http://vigna.di.unimi.it/ftp/papers/Broadword.pdf . I also have an initial java implementation of that, but it is not used here yet, there are only a few comments in the code here that it might be used. I'll open another issue for broadword bit selection later.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

We could, but in which class? For example, in CachingWrapperFilter it might be good to save memory, so it could be there.

This new doc id set might be used for other use-cases in the future, so maybe we should have this method on the EliasFanoDocIdSet class?

Also, would the expected size be the only thing to check for? When decoding speed is also important, other DocIdSets might be preferable.

Sure, this is something we need to give users control on. For filter caches, it is already possible to override CachingWrapperFilter.docIdSetToCache to decide whether speed or memory usage is more important. The decision can even depend on the cardinality of the set to cache or on its implementation. So we just need to provide users with good defaults I think?

I haven't run performance benchmarks on this set implementation yet, but if it is faster than the DocIdSets iterators of our default postings format, then they are not going to be a bottleneck and I think it makes sense to use the implementation that saves the most memory. If they are slower or not faster enough, then maybe other implementations such as kamikaze's p-for-delta-based doc ID sets (#3824) would make more sense as a default.

Can PackedInts.getMutable also be used in a codec?

The PackedInts API can return readers that can read directly from an IndexInput if this is the question but if we want to be able to store high and low bits contiguously then they are not going to be a good fit.

I considered a decoder that returns ints but that would require a lot more casting in the decoder.

OK. I just wanted to have your opinion on this, we can keep everything as a long.

I'll open another issue for broadword bit selection later.

Sounds good! I think backwards iteration and efficient skipping should be done in separate issues as well, even without them this new doc ID set would be a very nice addition.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Ok, I'll post a new patch that

is in package oal.util.packed,
uses numberOfLeadingZeros instead of a loop,
has a correct getCost, and
has a static method sufficientlySmallerThanBitSet, deliberately imprecise...

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

As announced on 4 July 2013.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The patch looks ready. I think we should just add a bit more randomization to the tests before committing.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Would you have any specific purpose for randomized testing?

Randomized test cases with uniform data distributions are not likely to test exceptional situations in the high bits such as long high bit words with all zeros or all ones.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I tend to like testing different scenarii every time tests are run (and tests, especially lucene-core tests, are run very very often by the CI servers), this helped find many unsuspected bugs in the past. For example, the random variable can be used to compute slight variations from the exceptional situations which are interesting to test.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Randomized test cases with uniform data distributions are not likely to test exceptional situations in the high bits such as long high bit words with all zeros or all ones.

The exceptional situations you can test separately. I am constantly surprised at how many exceptional conditions in pretty much regular code one can overlook. Like Adrien said – it doesn't hurt to be in there and if it catches something, it's even better.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

At the moment I don't know how to do better random testing than the code that produced the results posted above, so for now I'll concentrate on adding an index.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

for now I'll concentrate on adding an index.

Have you started working on it? Otherwise I would like to commit your patch, I think it is already a very good improvement and we can work on the index on another issue, what do you think?

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Otherwise I would like to commit your patch, I think it is already a very good improvement and we can work on the index on another issue

+1 definitely. One step at a time. Thanks Paul & Adrien.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

For now the indexed version is in subclasses of the encoder and decoder. In these I only expect some changes in method signatures from private to protected, so I don't mind either way.

asfimport commented 11 years ago

Paul Elschot (migrated from JIRA)

Broadword bit selection is at #6162 .

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1501576 from @jpountz https://svn.apache.org/r1501576

LUCENE-5084: EliasFanoDocIdSet.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1501577 from @jpountz https://svn.apache.org/r1501577

LUCENE-5084: Move Elias-Fano encoding from Lucene 4.4 to 4.5...

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1501578 from @jpountz https://svn.apache.org/r1501578

LUCENE-5084: EliasFanoDocIdSet (merged from r1501576 and r1501577).

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Committed, thanks Paul!

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

4.5 release -> bulk close

apache / lucene

EliasFanoDocIdSet [LUCENE-5084] #6148

6116