OneArb commented 9 years ago

I am checking if I could use blosc to compress 1000 char long strings or so.

As a test I am using the string "Methionylthreonylthreonylglutaminyla..." which is highly repetive.

http://blog.jmay.us/2009/11/longest-english-word.html

I modified simple.c and the best I can get is 1.5x compression with shuffle and 2.8x without shuffle at clevel 9

without shuffle

chars 1000 1.4x 2000 1.8x 3000 2x 4000 2.1x 5000 2.3x

ZIP compresses the full string to 5.5x

Follows my settings :

define LINESIZE 98310

define SIZE 100000

define SHAPE {10,10,10}

define CHUNKSHAPE {1,10,10}

static unsigned char data[LINESIZE]; static unsigned char data_out[SIZE]; static unsigned char data_dest[LINESIZE];

Questions : Am I within expected compression ratios without switching to Zlib ?

Is the block / string I intend to compress too small for blosc use case ?

Is there any prospect for blosc to support indexed and random access of compressed blocks ?

Any suggestions for performance "small" string compression ?

OneArb commented 9 years ago

Closed further research answered most questions.

FrancescAlted commented 9 years ago

Yes, the default compressor in Blosc (BloscLZ) is geared towards speed, not compression ratio, but maybe included LZ4HC or ZLIB can get better ratios, specially when using large blocksizes. Does this match your research or you found something different?

2014-12-03 9:08 GMT+01:00 OneArb notifications@github.com:

Closed #73 https://github.com/Blosc/c-blosc/issues/73.

— Reply to this email directly or view it on GitHub https://github.com/Blosc/c-blosc/issues/73#event-201786383.

Francesc Alted

OneArb commented 9 years ago

1) I found a few compression overview http://compressionratings.com/sort.cgi?rating_sum.brief+6n

https://docs.google.com/spreadsheet/ccc?key=0AiLIAFlgldSodENkNEhIM3lDZEtBTlFUQ29FdWhvTEE&usp=sharing#gid=2

http://heartofcomp.altervista.org/MOC/MOCACE.htm

Would it be worth submiting and get blosck in the fray ?

Looking over the benchmark section I notice that bloscLZ is the only decompressor able to outperform memcopy, at least on your machine.

[blosc zlib benchmark] 'http://www.blosc.org/benchmarks-zlib.html) use a different compression scale than the other compressor. It also starts at 0% (vs. 1) which interfers with the graph readability.

Some chart across compressors would ease comparison.

I sure would like to see bloscLZ take its due place within the compressor benchmark community.

2) simple.c uses almost all CPU bandwidth on my 2 core machine. Is that expected ?

esc commented 9 years ago

Regarding the zlib Benchmarks, the first measurement is also at one, but because zlib has such high compression ratios, especially with that dataset, it looks like the measurement is at zero. Ideally we should start all graphs at one, since this means "no compression".

Regarding the speed of BloscLZ, I believe what you are seeing is a distortion due to measurement. The only benchmarks we have listed fo LZ4 right now are from a BlueGene. This is a HPC architecture and let's just say things behave differently there than on commodity hardware. I believe that both LZ4 and BloscLZ (maybe snappy too) can outperform memcpy when driven by Blosc. The reason we don't have any LZ4 benchmarks listed yet is that support for driving LZ4 from Blosc has only been officially supported for about a year now. Support for BloscLZ is much older, so many benchmarks have accumulated for this one.

esc commented 9 years ago

FYI: the reason we get these "off-the-charts" ratios for zlib is because of the shuffle filter in Blosc that can pre-condition certain datasets favorably for zlib, effectively boosting the compression ratio.

OneArb commented 9 years ago

https://www.youtube.com/watch?v=IzqlWUTndTo at 9:39 provides the comparative chart I was looking for. LZ4 seems indeed a bit more speedy overall considering range, linear and random distribution.

at 11:19 compressor charts vs. memcopy for each distribution type.

I see Intel Core i5 test for each supported compressor on http://blosc.org/synthetic-benchmarks.html, perhaps the benchmark distorsion has some other source ?

Blosc / c-blosc

blosc use case #73

define LINESIZE 98310

define SIZE 100000

define SHAPE {10,10,10}

define CHUNKSHAPE {1,10,10}