PSeitz / lz4_flex

Fastest pure Rust implementation of LZ4 compression/decompression.
MIT License
425 stars 29 forks source link

lz4hc compression support #21

Open roblabla opened 2 years ago

roblabla commented 2 years ago

Hello, currently lz4_flex currently only supports the "standard" LZ4 compression, which means it can only compress at level 0, 1 and 2 (in liblz4, those levels are all the same). To support higher compression levels, LZ4 also has a mode called High Compression (HC), but it is unfortunately unimplemented in lz4_flex.

I was wondering if lz4_flex has any plans to implement HC mode in its compressor?

PSeitz commented 2 years ago

Hi,

generally yes, I would really like to have it in lz4_flex. I did some experiments some time ago, but they are outdated now.

I think conceptually it should be quite simple, instead of having one entry in the hashmap to search for duplicates back in the stream we would have multiple buckets for one hash and they would all get checked to find the best one. There may be other strategies, but I think this is the main one, for better compression in lz4.

Multiple Positions per Hash in the Hashmap

e.g. imagine this byte sequence

[1, 2, 3, 4, 98, 1, 2, 3, 4, 97, 1, 2, 3, 4, 98]

While walking over the data, we put hash of every 4 bytes and its position into the hashmap.

1, 2, 3, 4 -> (hash 999, position 0) 2, 3, 4, 96 -> (hash 555, position 1)

... when we arrive a the second entry, currently it would overwrite the old entry 1, 2, 3, 4 -> (hash 999, position 5)

If we would keep multiple entries for the hash, the last sequence 1, 2, 3, 4, 98, could compare both entries and see that the first is a longer match.

Currently the hashmap is simply a Vec, and the hash is right shifted to fit in the Vec bounds. It should be possible to double (or quadruple) the Vec and then have multiple posititons for one hash.