Closes https://github.com/allen-cell-animated/timelapse-colorizer/issues/401, writing data arrays as binary data to speed up the load times of datasets. After some research, we settled on the Apache Parquet format, since we already use it internally and it's recognized by a lot of the scientific community.
I've tested with the NucMorph datasets and found that, altogether, the new feature files are 10% the size they used to be! The amount of compression varies pretty widely based on the type of data though.
Estimated review size: medium, 20-30 minutes
Changes
Moves feature min/max from the feature JSON files to the dataset manifest JSON. (This allows the JSON/parquet files to only hold data.)
Adds an option to save files in the .parquet format in ColorizerDataWriter.write_feature() and ColorizerDataWriter.write_data().
Adds unit tests for writing JSON and Parquet data.
Closes https://github.com/allen-cell-animated/timelapse-colorizer/issues/401, writing data arrays as binary data to speed up the load times of datasets. After some research, we settled on the Apache Parquet format, since we already use it internally and it's recognized by a lot of the scientific community.
I've tested with the NucMorph datasets and found that, altogether, the new feature files are 10% the size they used to be! The amount of compression varies pretty widely based on the type of data though.
Estimated review size: medium, 20-30 minutes
Changes
.parquet
format inColorizerDataWriter.write_feature()
andColorizerDataWriter.write_data()
.Validation