Open GenevieveBuckley opened 5 years ago
Comparing the different compression algorithms and compression levels, it was found that zstd with a compression level = 9 (max) reduced a 381.5MB file to 507KB was the most effective.
import zarr
import numpy as np
from numcodecs import Blosc
compressor = Blosc(cname='zlib', clevel=9, shuffle=Blosc.BITSHUFFLE)
data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
z = zarr.array(data, chunks=(1000, 1000), compressor=compressor)
print(z.compressor)
print(z.info)
Code used from https://zarr.readthedocs.io/en/stable/tutorial.html
Summary of Results: detailed compression results.pdf
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 3379344 (3.2M)
Storage ratio : 118.4
Chunks initialized : 100/100
Note: 2 seconds
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 519332 (507.2K)
Storage ratio : 770.2
Chunks initialized : 100/100
Note: 45 seconds
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 13543704 (12.9M)
Storage ratio : 29.5
Chunks initialized : 100/100
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 13788015 (13.1M)
Storage ratio : 29.0
Chunks initialized : 100/100
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 5137515 (4.9M)
Storage ratio : 77.9
Chunks initialized : 100/100
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 5129740 (4.9M)
Storage ratio : 78.0
Chunks initialized : 100/100
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 25851986 (24.7M)
Storage ratio : 15.5
Chunks initialized : 100/100
2 seconds
(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type : zarr.core.Array
Data type : int32
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Order : C
Read-only : False
Compressor : Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE,
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 25851986 (24.7M)
Storage ratio : 15.5
Chunks initialized : 100/100
2 seconds
Thank you @timbo8! This is very helpful for us :)
@timothywallaby has done a bunch of work at the pyconau sprints working out how to save openslide images as compressed zarr files. This gist shows how to do that if you can fit all the array into memory. Thanks @timothywallaby!
We're working on how to append to zarr arrays now, for cases where you cannot fit the entire image into memory.
The code from @sofroniewn is here: https://github.com/sofroniewn/image-demos/blob/master/helpers/make_2D_zarr_pathology.py
The instructions were not to use it as is until we can work out why the saved file is bigger than the original tiff. Personally I also feel that for this purpose we don't really need the multilevel hierarchy, so that might make things a bit simpler.
Yeah I think @thewtex has similar success with zstd. He may also have some good thoughts on histology datasets that we could look at.
@dzenanz surveyed a wide variety of codecs and compression levels over a diverse set of image datasets. In general, he also found that zstd
and lz4
with blosc bitshuttle enabled performed best. However, a compression level of 9 did not justify the increased compute time over a compression level of three. @dzenanz could you please share your results?
More histopathology images can be found here:
https://digitalpathologyassociation.org/whole-slide-imaging-repository
We want to use a histology slide image from the 2016 Camelyon dataset (CC0): https://camelyon17.grand-challenge.org/Data/
This thread contains details specific to this dataset.
Related to the larger discussion here: https://github.com/dask/dask-image/issues/107