Blosc / bloscpack

Command line interface to and serialization format for Blosc
BSD 3-Clause "New" or "Revised" License
122 stars 27 forks source link

numpy savez_compressed much smaller filesizes for small arrays #116

Open lopsided opened 2 years ago

lopsided commented 2 years ago

I have a few million images to save to disk and have been trying a few options out. I thought blosc/bloscpack would be well suited but I'm getting far larger image sizes than using the standard numpy savez_compressed.

My images are size (3,200,200) and dtype=float32. Typical file sizes I'm getting are:

For a sample of 370 images this gives:

67M      ./blosc_packarray
67M      ./blosc_pointer
121M     ./bp
19M      ./npz
172M     ./uncompressed

For the blosc_* methods I'm writing the packed bytes like:

with open(dest, 'wb') as f:
            f.write(packed)

Is there anything I'm missing or is numpy's compression just as good as it gets for small images like these?

esc commented 2 years ago

@lopsided thank you for asking about this. What settings are you using for Blosc and bloscpack. Maybe you need to either use a higher compression setting (like 9) and/or change the internal algorithm? I think it could be worth a shot.

esc commented 2 years ago

@lopsided a list of settings to explore is here: https://github.com/Blosc/bloscpack#settings

If you can share the data or an anonymized variant that has similar entropy we could look into this in more detail.

lopsided commented 2 years ago

Thanks for the quick reply!

I've just been using pretty much default settings:

packed = blosc.compress_ptr(
    address=images.__array_interface__['data'][0],
    items=images.size,
    typesize=images.dtype.itemsize,
    clevel=9,
    shuffle=blosc.SHUFFLE
)
packed = blosc.pack_array(images)
bp.pack_ndarray_to_file(images, dest)

I've attached an example image (actually a triplet of greyscale images), saved uncompressed using np.savez. (I had to rename it to .zip to make github happy).

000000.zip

esc commented 2 years ago

Thanks for the quick reply!

Thank you, it may take me a few days to tinker.

esc commented 2 years ago

I am so sorry, but there was no space left in my schedule to look into this.