Compression using a common LZW dictionary for the whole index or document types

acmeguy commented 11 years ago

Hello,

We are storing massive amount of data which is quite redundant (documents do not vary greatly). For this reason I have been looking into the compression option of ES and see that individual documents or document in a bulk can be indexed together.

The drawback of the bulk method is that the whole bulk would needed to be uncompressed if a single document in the bulk is needed (as I understand it).

An interesting option for us would be the ability to use a single, growing, dictionary to compress all the _source documents. This would make result in a bigger LZW dictionary, but it would also mean a better compression ratio and the ability to uncompress single documents.

Is this something which has been considered?

Very best regards, -Stefan Baxter

clintongormley commented 11 years ago

Hi Stefan

Single document or bulk indexing is no different. Also, the default codec that we use in 0.90 to write new segments compresses stored fields and term vectors by default.

You don't need to enable anything for this to work - it's the default now.

acmeguy commented 11 years ago

Hi,

So ES is using a single common LZW dictionary for all _source documents already?

Regards, -Stfan

clintongormley commented 11 years ago

For all _source documents in a segment, yes. If you want to make that a single dictionary for a shard, then optimize the shard down to one segment. Of course, if you keep indexing, then it is pointless to do that - new segments will be created, but smaller segments get merged into a bigger segment automatically, so you can just leave it up to ES to handle.

acmeguy commented 11 years ago

Great, thank you!

jpountz commented 11 years ago

I'll try to give a little more information on how stored fields compression works under the cover.

The stored fields file is compressed into blocks of 16K or more, and Lucene maintains a very compact in-memory data-structure that maps every document ID to the start offset of the block that contains it (a document cannot span across several blocks). All blocks are independent, meaning there is no shared dictionary, and are compressed using LZ4 (https://code.google.com/p/lz4/), an LZ77-based compression codec which trades CPU for compression ratio.

At reading time, there is an optimization that allows to decompress as little data as needed. For example, if you are looking for the 2nd document of a block that contains 7 documents, it is very likely that the block will only be partially decoded.

There are a few optimizations that would allow to improve the compression ratio, but it is likely that they will bring a few drawbacks too. For example, the doc->offset mapping that Lucene stores in memory is very compact because it only actually stores only the address for the first document of a block (and uses a binary-search like approach to find the block offset for a given doc ID). If documents were not stored in large blocks anymore, this will likely require more memory.

Additionally, the fact that there is no shared state across segments is great for merging, since in a few common cases Lucene can directly copy the compressed data without having to decompress and then compress it again.

acmeguy commented 11 years ago

Thank you.

This also means that the real answer to my question is no :). (if I understand you correctly).

I know I'm nitpicking here (this is more a curiosity rather than a requirement) but the data I'm working with would really benefit from being compresses using a single, common, compression dictionary.

I understand the logic behind the current approach and am in no position to state that a common dictionary has enough benefits to warrant an inquiry.

Very best regards, -Stefan

clintongormley commented 11 years ago

OK, so I fudged it a bit :)

acmeguy commented 11 years ago

NP.

Dealing effectively with highly redundant data should be quite valuable. I hope an opportunity will present it self for you guys to investigate it further :)

Regards, -Stefan

elastic / elasticsearch

Compression using a common LZW dictionary for the whole index or document types #3092