Payloads should be compressed [LUCENE-6764]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.59k stars 1.01k forks source link

Payloads should be compressed [LUCENE-6764] #7822

Open asfimport opened 9 years ago

asfimport commented 9 years ago

I think we should at least try to do something simple, eg. deduplicate or apply simple LZ77 compression. For instance if you use enclosing html tags to give different weights to individual terms, there might be lots of repetitions as there are not that many unique html tags.

Migrated from LUCENE-6764 by Adrien Grand (@jpountz), updated Sep 02 2015

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I disagree. Payloads should be something small like a byte or two. I dont even think they should be variable length: its a trap that adds additional per position noise. We should not encourage putting the contents of moby dick per position nor should we suffer the complexity hassles.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

Alternatively the html structure could be indexed, see #6689. Then one can query the structure and add the weights to the query.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Payloads should be something small like a byte or two. I dont even think they should be variable length: its a trap that adds additional per position noise. We should not encourage putting the contents of moby dick per position nor should we suffer the complexity hassles.

Of course you want payloads to be small. My point was that there is likely a very finite set of unique payloads and so we could likely store these payloads on a couple of bits instead of one or two entire bytes.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

store these payloads on a couple of bits

An EliasFanoSequence can do just that and is indexable by position. The sequence is normally non decreasing, so for random (small) numbers one should encode their cumulative sums.

An Elias-Fano encoding uses at most 2 + ceil(log(upperBound/numValues)) bits per encoded number.

Here numValues is the number of positions with payloads, and upperBound is the sum of the payloads.

The EliasFanoBytes can be used as a single payload per document (as currently at LUCENE-5627), or maybe better as a docvalue.

For now this is only efficiently indexable by value (to implement advancing in a DocIdSet). Efficient indexing by position (index in the sequence) can be easily added. These indexes have one entry per 256 values, so their size overhead is quite small.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

An easier implementation would be to store the n bits per payload contiguously in a bytesref docvalue. When that docvalue becomes too big to be completely loaded per document, it can be made seekable.

A seekable docvalue (or payload) would also be better than the sometimes large payloads now used at #6689.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Actually I was thinking of something simpler... The current postings format encodes everything into blocks of 128 values, so we could do dictionnary encoding independently on each block.

asfimport commented 9 years ago

Paul Elschot (migrated from JIRA)

That should work, too. Plenty of implementation options though, and I have no idea how much time can be spent on compression during indexing.