Use binary format for feature data

ShrimpCryptid commented 2 weeks ago

Use Case

Loading features is VERY slow currently in our example datasets and we can definitely do better. Features are currently stored as JSON data with an array of numbers; changing this to a binary format would significantly decrease the size of feature files and allow datasets to load faster.

The tradeoff is that they will be harder to read, but if we maintain backwards compatibility with the JSON format it should not be an issue.

Acceptance Criteria

Update colorizer-data to write to a binary format, most likely 4-byte integer or float values.
- Will we need to include headers in the data? Are there libraries we can use for this already?
Update TFE to parse binary data for the feature data.

Details

Feature files currently contain a min and max field, which could be moved to the feature metadata stored in the manifest JSON file. This actually would speed up a few interactions on the UI since we wouldn't need to load the feature file to know the min/max (namely feature thresholds).

ShrimpCryptid commented 2 weeks ago

TODO: Research different binary formats (npy, tiff, hd5, zarr, etc.) Ideally find something that is single-file, must be read/writeable from both JS and Python

Also would be nice for this change to include a conversion utility in colorizer-data that changes feature data to binary

toloudis commented 2 weeks ago

I would consider starting with npy but also Apache Arrow or Parquet. It's a fairly modern and well supported array container. https://arrow.apache.org/docs/js/ https://arrow.apache.org/docs/python/ You'll have to read a little bit on the difference between Arrow and Parquet. I think basically Parquet is a higher level thing on top of Arrow. https://stackoverflow.com/questions/56472727/difference-between-apache-parquet-and-arrow

ShrimpCryptid commented 1 week ago

Okay, typing up some (basic) research notes:

Parquet is intended for archival or storage on disk. It's VERY space efficient at the cost of being more expensive to read, since it must be decoded. It's column-oriented, which is great for us, since we're basically only storing arrays. HOWEVER, Parquet doesn't have a lot of JS support, and the libraries I've seen don't seem to be in widespread use. Also, Parquet is designed to be directory-based (directory containing one or more files, like Zarr) which would make our datasets a little messier to navigate.

Arrow is intended for storage/data manipulation in memory, and it and Parquet are designed to work together. I think Arrow doesn't provide the same data compression advantages, since it's designed to be 1:1 with memory.

I also peeked at HDF5 and .NPY file formats, and both of those have similar issues with there not being a lot of widely used libraries.

I'm looking into Apache Avro now, which is similar to protobuf but designed to be extensible? Both define binary schemas, which might be a bit overkill for what we're doing, but Avro has pretty widespread support: https://avro.apache.org/ The nice thing is that this would make it a drop-in replacement for our current JSON feature files, which store a tiny bit of extra metadata (min/max), though I could still just move this to the feature metadata in the manifest.

toloudis commented 1 week ago

My notes: Bearing in mind we are working with tiny files relative to what these systems are designed for. Simplicity on the javascript side will be a plus.

Avro seems to be row-oriented which is the opposite of what we need for this data. But the point about storing min/max (extra data apart from the main array) is very relevant.

Compression is probably not important for data of this size - we will already be getting effective "compression" just by moving out of json. Tradeoff of file download size vs decoding time is probably minimal, but that's just a guess.

Is it easy to write some python code to output 3 or 4 different formats: npy, avro, arrow for example, and then compare?

(Tangentially, I'd also be curious to see if it's effective to store all the features in one single file - this would have implications for CFE also)

allen-cell-animated / timelapse-colorizer