allen-cell-animated / colorizer-data

example scripts and utilities for preparing data for the time series colorizer app
Other
1 stars 0 forks source link

feature: Write features as `.parquet` binaries #53

Closed ShrimpCryptid closed 4 months ago

ShrimpCryptid commented 4 months ago

Closes https://github.com/allen-cell-animated/timelapse-colorizer/issues/401, writing data arrays as binary data to speed up the load times of datasets. After some research, we settled on the Apache Parquet format, since we already use it internally and it's recognized by a lot of the scientific community.

I've tested with the NucMorph datasets and found that, altogether, the new feature files are 10% the size they used to be! The amount of compression varies pretty widely based on the type of data though.

Estimated review size: medium, 20-30 minutes

Changes

Validation