Open CarloLucibello opened 2 years ago
Have done this for large vision datasets like COCO that have annotations in JSON which can be slow to parse. One thing to keep in mind is the size of the JLD2 files, though of course it shouldn't be a problem for MNIST. Arrow.jl can also be a good format with built-in compression when the data has samples made up of primitive types and arrays.
What's to be expected from the JLD2 sizes? hopefully not larger than the size of the original data, right?
Depends. If you have a large dataset of .jpg images and store them as arrays (hence losslessly), size can be multiples.
I agree too Arrow.jl is a good format:
HuggingFace's datasets library also uses Arrow: https://huggingface.co/docs/datasets/about_arrow
some code showing how to read/write color arrays from/to arrow tables https://gist.github.com/CarloLucibello/51d713ec4a1612b46e6c90e53c0f88e8
We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g.
d = MNIST()
we just load the JLD2 file.Example: