apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

RoaringDocIdSet [LUCENE-5983] #7045

Closed asfimport closed 10 years ago

asfimport commented 10 years ago

Robert pointed me to this paper: http://arxiv.org/pdf/1402.6407v4 that describes an interesting way to build doc id sets: The bit space is divided into blocks of 2^16 bits so that you can store the bits which are set either in a short[] (2 bytes per doc ID) or in a FixedBitSet. The choice is easy, if less than 2^12 bits are set, then the short[] representation is more compact otherwise a FixedBitSet would be more compact. It's quite similar to the way that Solr builds DocSets in SolrIndexSearcher.getDocSet(DocsEnumState).


Migrated from LUCENE-5983 by Adrien Grand (@jpountz), 1 vote, resolved Oct 06 2014 Attachments: LUCENE-5983.patch

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here are a patch and the results of the benchmark on this new DocIdSet: http://people.apache.org/\~jpountz/doc_id_sets4.html

Compared to the paper, I also optimized the super-dense case in order to encode the negation of the set to save space (like WAH8DocIdSet does). I like the fact that this set is overall very fast, especially to build, which is an interesting characteristic for caching, so I replaced WAH8DocIdSet with this new DocIdSet in CachingWrapperFilter. Another nice property is that it is naturally indexed, it doesn't need a side-car data-structure like the WAH8 and PFOR doc id sets: in the array case, you can just perform a binary search, and in the FixedBitSet case you already have random-access.

In order to avoid the profusion of doc id sets, I'm thinking of removing the WAH8 and PFOR doc id sets since this one looks better to me in most areas.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Cool. I also like that it's worst-case in advance() is not too shabby.

Why remove WAH8 & PFOR yet not also Elias-Fano? Because EF compresses the most and doesn't perform as bad as those two in most advance() scenarios?

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Why remove WAH8 & PFOR yet not also Elias-Fano? Because EF compresses the most and doesn't perform as bad as those two in most advance() scenarios?

I have to admit I don't know this set as well as the PFOR and WAH8 ones, but indeed it seems to compress very efficiently sparse sets and also has the nice property that WAH8 and PFOR miss that it is naturally indexed, ie. it doesn't need a side-car data-structure in order to be able to skip efficiently.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1629605 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1629605

LUCENE-5983: CachingWrapperFilter now uses RoaringDocIdSet.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1629606 from @jpountz in branch 'dev/branches/branch_5x' https://svn.apache.org/r1629606

LUCENE-5983: CachingWrapperFilter now uses RoaringDocIdSet.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Committed to trunk and branch_5x. I also opened /LUCENE-5993 to discuss the removal of the WAH8 and PForDelta sets.

asfimport commented 9 years ago

Anshum Gupta (@anshumg) (migrated from JIRA)

Bulk close after 5.0 release.