Notes on compression of run data

Vindaar / TimepixAnalysis

Contains code related to calibration and analysis of Timepix based detector + CAST related code

MIT License

20 stars 6 forks source link

As of commit https://github.com/Vindaar/TimepixAnalysis/commit/b4363c4c4b3f324d53845d404fd8e4c5d4b223eb some performance notes of compression with different filters. Take the numbers (especially the run times with a grain of salt, due to caching). Data run 146 from 2018 was used as a benchmark. All this is done with a batchsize (== chunksize for HDF5) of 10000 elements.

none:
- time: 0.48 minutes
- size: 262M
zlib:
- level: 4
- time: 0.59 minutes
- size: 84M
zlib:
- level: 6
- time: 0.89 minutes
- size: 80M
zlib:
- level: 9
- time: 2.17 minutes
- size: 79M
blosc
- filter: LZ4
- level: 4
- shuffle: false
- time: 0.75 minutes
- size: 123M
blosc
- filter: LZ4
- level: 9
- shuffle: false
- time: 0.68 minutes
- size: 120M
blosc
- filter: LZ4
- level: 4
- shuffle: true
- time: 0.66 minutes
- size: 88M
blosc
- filter: LZ4
- level: 9
- shuffle: true
- time: 0.49 minutes
- size: 85M

While blosc LZ4 with compression level 9 and shuffle is basically as fast as no compression and resulting in a file almost as small as zlib level 9, it comes with a major drawback: hdfview cannot read blosc compressed datasets. Therefore it's probably wise to choose zlib level 4 for now.

Continuing with the reconstruction. Using the .h5 file of run 146 after raw data manipulation is done of the zlib level 4 case with 84M size. First of all replacing datatypes of cluster x, y and ch from int to uint8 and uint16 respectively:

compression: none, dtypes: int
- time: 19.4 seconds
- size: 252M
compression: none, dtypes: uint8, uint16
- time: 17.6 seconds
- size: 125M From here on use proper uint8 and uint16 datatypes. Applying compression zlib level 4 here as well:
zlib, dtype: uint8, uint16
- level: 4
- time: 18.7 seconds
- size: 120M As one can see the majority of data compression results from simply dropping a huge amount of 0s.

~~As a crosscheck perform zlib level 4 filtering on int datasets:~~

zlib, dtype: int
- level: 4
- time: 21.4 seconds
- size: ~~247M~~

~~Wow, contrary to my belief, that does not work.~~ ~~This is a very good argument for trying to optimize datatypes even more!~~ EDIT: there was a bug in the software, which caused us to not actually store the rest of the data as uint8/16 before writing it. Therefore we didn't actually store all x, y, ch values as uint8/16, but rather the zeros of the remaining int64 data. Screws up the above numbers. Going back to int is a no-go anyways.

Vindaar / TimepixAnalysis

Notes on compression of run data #6