Re-evaluate different ways to encode postings [LUCENE-10672]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.61k stars 1.02k forks source link

Re-evaluate different ways to encode postings [LUCENE-10672] #11707

Open asfimport opened 2 years ago

asfimport commented 2 years ago

In Lucene 4, we moved to FOR to encode postings because it woud give better throughput compared to VInts that we had been using until then. This was a time when Lucene would often need to evaluate entire postings lists, and optimizations like BS1 were very important for good performance.

Nowadays, Lucene performs more dynamic pruning and it's less frequent that Lucene needs to evaluate all hits that match a query. So the performance of nextDoc() has become a bit less relevant while the performance of advance(target) has become more relevant.

I wonder if we should re-evaluate other ways to encode postings that are theoretically better at skipping, such as Elias-Fano coding, since they support skipping directly on the encoded representation instead of requiring decoding a full block of integers where only a couple of them would be relevant.

Migrated from LUCENE-10672 by Adrien Grand (@jpountz)

asfimport commented 2 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe we can also peek at how Vespa and Tantivy do their encoding ... any inspiration there?

The world needs more popular open-search source engines.

Tjianke commented 1 year ago

Lucene community has the good tradition of incorporating academic results. Recent studies show many efficient algorithms like Partitioned Elias-Fano. Here's a great paper comparing the space-time trade-off of various index compression. I'm interested in evaluating current codec and implementing state-of-art compressions if viable. But I'm new to Lucene, which aspects (index size/query speed/memory/CPU) do we need to evaluate? Is there any mature benchmark to exploit, I noticed a LuceneBenchmark repo though. @mikemccand @jpountz

gsmiller commented 1 year ago

@Tjianke the luceneutil benchmarks are a great place to start. These power the nightly benchmarks that continuously run. For changes to such a core part of Lucene, the existing benchmark tasks will likely capture performance changes well (I would think). They're primarily focused on throughput/query-cost, but they can create separate indexes for your baseline/candidate code branches and you can manually diff the index sizes.

For something like this, running microbenchmarks may also be useful. We did this the last time we made a change to the postings format. There's some history in #10889. I can try to jog my memory on that issue and answer questions if you have any :)

Excited to see what comes of this!