g1mv / density

Superfast compression library
Apache License 2.0
1.02k stars 48 forks source link

Fallback to 1-bit tags for selected blocks #11

Closed tarsa closed 9 years ago

tarsa commented 10 years ago

There are two algorithms in SHARC:

The issue is that not every time the second one compresses better and that is due to the fact that overhead the tags are causing is not always offseted by reduction of other parts of compressed stream.

An example is on Silesia corpus: http://mattmahoney.net/dc/silesia.html - file 'sao'

My proposal is to do fallback to 1-bit tags when we detect that 2-bit tags cause too much overhead causing compression ratio deterioration.

How to detect if 2-bit tags do too much overhead? One solution would be to compress each block twice - once with 1-bit tags and once with 2-bit tags, but that would be detrimental to compression speed. So a better solution would be to prepare a histogram of tags. Since in 2-bit tags mode there are only 4 different tag values it shouldn't bring too much speed overhead to compute the histogram. Once computed you can do simulation of what would happen if you do compression with 1-bit tags. Since it seems it couldn't be done exactly and super quickly, an probabilistic method should be used. You can eg assume that:

How to do switching between 1-bit tags mode and 2-bit tags mode? I think both modes should share the dictionaries. For example the 1-bit tags mode could use the first slots from dictionary and also update (but don't use) the prediction array. Or use the second slots only and don't touch the prediction array. Or use only first slots from dictionary for doing matches but update both slots as in 2-bit tags mode. Or some other combination. It should be manually tested which solution is the best. Another thing is that for fallback we would need to do some sort of backup of dictionaries on every block boundary. I think that with big enough blocks and use of some special instruction that don't pollute caches it could be pretty fast. Or do something different - don't do backups unless we detected that the last block could've been compressed better in 1-bit tags mode and stop doing backups when in two blocks in a row the 1-bit tags mode doesn't bring improvement.

g1mv commented 10 years ago

Hello Piotr thanks for this. Two things need verification :

The loss on the sao file is only 2%, so kernel switching should have a near-zero performance hit for it to be considered a benefit, as 95+% of files tested are better compressed with 2-bit mode compared to 1-bit.

tarsa commented 10 years ago

Hi,

I've just realized that the potential gains of fallback would be rather small and present only on hardly compressible blocks. Therefore there's no sense of testing for fallback on well compressible blocks. It would make sense then the compression ratio is worse than about 80% - 90% (ie as percentage of original size) and the compressed stream without signatures is less than about 95% of original size (because otherwise the block is incompressible anyway).

But even if it would work, then 2% - 3% improvement in compression ratio on hardly compressible files wouldn't offset even the penalty of having to recompress the block. Maybe I was just too quick to try to improve things and forgot the emphasis on speed :)

Thanks for attention and have a nice day!