Blosc / python-blosc

A Python wrapper for the extremely fast Blosc compression library
https://www.blosc.org/python-blosc/python-blosc.html
Other
350 stars 73 forks source link

add docs for set_blocksize for release #76

Closed ThomasWaldmann closed 9 years ago

ThomasWaldmann commented 9 years ago

Hi, great stuff you have done in blosc, thanks! :)

I am trying it for https://attic-backup.org/ and due to the way attic works, it usually processes ~ 64KB big chunks of data. I first had a bit trouble to get blosc into parallel execution until I found set_blocksize.

So I'ld like to suggest that you add a bit docs for it and not just write "experts only" - maxing out speed is your whole point, right? So blosc should not just use 1 thread.

I currently have set blocksize to 8192, assuming that 64KB chunk size divided by maybe 8 cores would give 8KB blocks. Is this correct?

Also, I'ld like to suggest to do a new release on pypi - having "dev" packages as a dependency is a bit ugly.

ThomasWaldmann commented 9 years ago

Hmm, meanwhile I somehow have the suspicion that there is some overhead per compress call and it gets a bit inefficient for small block sizes. Is it due to thread starting overhead? Is it starting / stopping threads per compress call?

FrancescAlted commented 9 years ago

No, there is a pool of threads, so this should not add too much overhead. The performance problem is a bit hairy to describe because caches are strange beasts. My experience playing with buffers is pretty much summarized in the compute_blocksize() function (https://github.com/Blosc/c-blosc/blob/master/blosc/blosc.c#L819), and that means that 16 KB is a minimum per thread, so if you have say, 4 cores, that would mean chunksizes of 64 KB.

Caches being complex creatures also means that it is difficult to document recommendations for users other than testing with different chunksizes and number of threads. Sorry about that.

esc commented 9 years ago

In addition it is worth mentioning, that an LZ77 compressor works by looking at previously seen data. If the blocks are small, there are boundaries which cannot be traversed by the compressor, meaning many small blocks will likely compress worse than larger ones overall. Also, there is always a fixed overhead per block in the form of a header, less blocks means less headers. So overall choosing a good blocksize is an art, hence the 'expert only' documentation.

Having said that, feel free to open a pull-request updating the docstring if you like.

Regarding the new release, it is being prepped already and should be out soon.

ThomasWaldmann commented 9 years ago

thanks for working on new release and for the explanations.

yes, i see that the small chunks are increasing overhead, I'll see if it makes sense to increase chunksize in attic. it would decrease overhead at other places also, but a increased chunksize might mean less deduplication because larger chunks are less likely duplicates, so it's tricky...

btw, did some benchmarks and on my test data lz4 was about as fast as no compression. a bit strange was that lz4 compression levels didn't change output size and lz4 level 9 (65s) seemed slightly faster than level 1 (69s), but both produced 3.79GB compressed data.

i was somehow wondering how much overhead constructing the bytes object it wants added (I have a memoryview as "data"):

def compress(self, data):
    return blosc.compress(bytes(data), 1, cname=self.CNAME, clevel=self._get_level())

I didn't find some other way, especially not how to give it a pointer and length.

FrancescAlted commented 9 years ago

@ThomasWaldmann blosc.compress_ptr() would not help?

http://python-blosc.blosc.org/tutorial.html#compressing-from-a-data-pointer

The example is for a numpy array, but using ctypes maybe you can make it run with strings as well.

ThomasWaldmann commented 9 years ago

i had a look at compress_ptr and also at memoryview's docs, but there is no (pure) python way to get the address of the data in a memoryview. I don't use numpy, but thanks for the tip about using ctypes.

But: I think there should be an easier way, most python devs won't invoke ctypes just to get at some pointer. See also the ticket I openend, maybe memoryviews could be supported better.

FrancescAlted commented 9 years ago

Agreed, supporting memoryviews would be cool. If you feel like you can contribute a PR for this, that would be fantastic.

esc commented 9 years ago

Proposed fix for this in #81

esc commented 9 years ago

Closing because open for too long.

esc commented 9 years ago

Feel free to reopen if you thing the issue persists.

esc commented 9 years ago

Here is the patch if you want to resurrect:

commit 25cd5871d5732d8c29c13d92a6381cf2ef4d515f
Author: Valentin Haenel <valentin@haenel.co>
Date:   Sat Mar 28 22:52:05 2015 +0100

    update docs for set_blocksize, fixes #76

diff --git a/blosc/toplevel.py b/blosc/toplevel.py
index 82932d0e36..8f51afd81c 100644
--- a/blosc/toplevel.py
+++ b/blosc/toplevel.py
@@ -98,13 +98,25 @@ def set_nthreads(nthreads):
 def set_blocksize(blocksize):
     """set_blocksize(blocksize)

-    Force the use of a specific blocksize.  If 0, an automatic
+    Force the use of a specific blocksize in bytes.  If 0, an automatic
     blocksize will be used (the default).

     Notes
     -----

-    This is a low-level function and is recommened for expert users only.
+    This is a low-level function and is recommended for expert users only.
+    Changing the blocksize can have profound effect on the performance of
+    blosc. If the blocksize is too large each block may not fit into the CPU
+    caches anymore and thereby rendering the blocking technique ineffective.
+    For example, a block may have to travel from and to memory twice, once when
+    applying the shuffle filter and a second time for doing the actual
+    compression. Also, for a large blocksize, blosc may not be able to split
+    the input, depending on it's size, which in turns means no multithreading.
+    If the blocksize is too small, the amount of constant overhead is increased
+    since each block must store a header that contains information about it's
+    compressed size. Additionally LZ77 style compressors may not reach the same
+    compression ratio as with larger blocks since their internal dictionary can
+    not be reused across block boundaries.

     Examples
     --------