Example image data - histology slide

GenevieveBuckley commented 5 years ago

We want to use a histology slide image from the 2016 Camelyon dataset (CC0): https://camelyon17.grand-challenge.org/Data/

This thread contains details specific to this dataset.

Related to the larger discussion here: https://github.com/dask/dask-image/issues/107

timbo8 commented 5 years ago

Comparing the different compression algorithms and compression levels, it was found that zstd with a compression level = 9 (max) reduced a 381.5MB file to 507KB was the most effective.

import zarr
import numpy as np
from numcodecs import Blosc
compressor = Blosc(cname='zlib', clevel=9, shuffle=Blosc.BITSHUFFLE)
data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
z = zarr.array(data, chunks=(1000, 1000), compressor=compressor)
print(z.compressor)
print(z.info)

Code used from https://zarr.readthedocs.io/en/stable/tutorial.html

Summary of Results: detailed compression results.pdf

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 3379344 (3.2M)
Storage ratio      : 118.4
Chunks initialized : 100/100
Note: 2 seconds

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zstd', clevel=9, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 519332 (507.2K)
Storage ratio      : 770.2
Chunks initialized : 100/100
Note: 45 seconds

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='blosclz', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 13543704 (12.9M)
Storage ratio      : 29.5
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 13788015 (13.1M)
Storage ratio      : 29.0
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4hc', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 5137515 (4.9M)
Storage ratio      : 77.9
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='zlib', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 5129740 (4.9M)
Storage ratio      : 78.0
Chunks initialized : 100/100

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='snappy', clevel=3, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 25851986 (24.7M)
Storage ratio      : 15.5
Chunks initialized : 100/100
2 seconds

(my-venv) C:\Users\prime.000\my-venv>compressor.py
Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE, blocksize=0)
Type               : zarr.core.Array
Data type          : int32
Shape              : (10000, 10000)
Chunk shape        : (1000, 1000)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='snappy', clevel=9, shuffle=BITSHUFFLE,
                   : blocksize=0)
Store type         : builtins.dict
No. bytes          : 400000000 (381.5M)
No. bytes stored   : 25851986 (24.7M)
Storage ratio      : 15.5
Chunks initialized : 100/100
2 seconds

GenevieveBuckley commented 5 years ago

Thank you @timbo8! This is very helpful for us :)

GenevieveBuckley commented 5 years ago

@timothywallaby has done a bunch of work at the pyconau sprints working out how to save openslide images as compressed zarr files. This gist shows how to do that if you can fit all the array into memory. Thanks @timothywallaby!

We're working on how to append to zarr arrays now, for cases where you cannot fit the entire image into memory.

GenevieveBuckley commented 5 years ago

The code from @sofroniewn is here: https://github.com/sofroniewn/image-demos/blob/master/helpers/make_2D_zarr_pathology.py

The instructions were not to use it as is until we can work out why the saved file is bigger than the original tiff. Personally I also feel that for this purpose we don't really need the multilevel hierarchy, so that might make things a bit simpler.

jakirkham commented 5 years ago

Yeah I think @thewtex has similar success with zstd. He may also have some good thoughts on histology datasets that we could look at.

thewtex commented 5 years ago

@dzenanz surveyed a wide variety of codecs and compression levels over a diverse set of image datasets. In general, he also found that zstd and lz4 with blosc bitshuttle enabled performed best. However, a compression level of 9 did not justify the increased compute time over a compression level of three. @dzenanz could you please share your results?

thewtex commented 5 years ago

More histopathology images can be found here:

https://digitalpathologyassociation.org/whole-slide-imaging-repository

dzenanz commented 5 years ago

In this thread, the first post is my explanation of the benchmark.

dask / dask-image

Example image data - histology slide #125