Vindaar / TimepixAnalysis

Contains code related to calibration and analysis of Timepix based detector + CAST related code
MIT License
20 stars 6 forks source link

Notes on compression of run data #6

Closed Vindaar closed 10 months ago

Vindaar commented 5 years ago

As of commit https://github.com/Vindaar/TimepixAnalysis/commit/b4363c4c4b3f324d53845d404fd8e4c5d4b223eb some performance notes of compression with different filters. Take the numbers (especially the run times with a grain of salt, due to caching). Data run 146 from 2018 was used as a benchmark. All this is done with a batchsize (== chunksize for HDF5) of 10000 elements.

While blosc LZ4 with compression level 9 and shuffle is basically as fast as no compression and resulting in a file almost as small as zlib level 9, it comes with a major drawback: hdfview cannot read blosc compressed datasets. Therefore it's probably wise to choose zlib level 4 for now.

Vindaar commented 5 years ago

Continuing with the reconstruction. Using the .h5 file of run 146 after raw data manipulation is done of the zlib level 4 case with 84M size. First of all replacing datatypes of cluster x, y and ch from int to uint8 and uint16 respectively:

As a crosscheck perform zlib level 4 filtering on int datasets:

Wow, contrary to my belief, that does not work. This is a very good argument for trying to optimize datatypes even more! EDIT: there was a bug in the software, which caused us to not actually store the rest of the data as uint8/16 before writing it. Therefore we didn't actually store all x, y, ch values as uint8/16, but rather the zeros of the remaining int64 data. Screws up the above numbers. Going back to int is a no-go anyways.

Vindaar commented 10 months ago

Our current defaults of the zlib filter strikes a good balance between performance and storage size.