A while back, Lucene changed the way that it encodes doc IDs from PFOR-delta to FOR-delta, which is a bit faster but less space-efficient. In order to avoid introducing space-efficiency regressions (especially on dense postings lists, which are common on Logging datasets), @iverase moved Elasticsearch to a copy of the Lucene postings format that would still use PFOR-delta for compression. (#103601)
But Lucene 9.12 introduced a new postings format that has better skipping logic (in general). It would be nice to take advantage of it. I would suggest the following plan:
Use Lucene912PostingsFormat on indexes whose storage efficiency is not critical (heuristic to be defined, e.g. when index.codec is default and source.mode is not synthetic?).
Create a new postings format that is a copy of Lucene912PostingsFormat but with a more space-efficient encoding of doc deltas. @dnhatn and I played with it earlier this year, there is room for significant improvement by storing exceptions (the P from PFOR stands for "patched") more efficiently and allowing more exceptions per block.
Move indexes whose storage efficiency is important to this new postings format instead of ES812PostingsFormat.
Disallow using ES812PostingsFormat on new indexes.
Move the write logic of ES812PostingsFormat to the test folder.
A while back, Lucene changed the way that it encodes doc IDs from PFOR-delta to FOR-delta, which is a bit faster but less space-efficient. In order to avoid introducing space-efficiency regressions (especially on dense postings lists, which are common on Logging datasets), @iverase moved Elasticsearch to a copy of the Lucene postings format that would still use PFOR-delta for compression. (#103601)
But Lucene 9.12 introduced a new postings format that has better skipping logic (in general). It would be nice to take advantage of it. I would suggest the following plan:
Lucene912PostingsFormat
on indexes whose storage efficiency is not critical (heuristic to be defined, e.g. whenindex.codec
isdefault
andsource.mode
is notsynthetic
?).Lucene912PostingsFormat
but with a more space-efficient encoding of doc deltas. @dnhatn and I played with it earlier this year, there is room for significant improvement by storing exceptions (the P from PFOR stands for "patched") more efficiently and allowing more exceptions per block.ES812PostingsFormat
.ES812PostingsFormat
on new indexes.ES812PostingsFormat
to the test folder.