JuliaML / MLDatasets.jl

Utility package for accessing common Machine Learning datasets in Julia
https://juliaml.github.io/MLDatasets.jl/stable
MIT License
228 stars 47 forks source link

write datasets in a JLD2 or Arrow format for faster read #125

Open CarloLucibello opened 2 years ago

CarloLucibello commented 2 years ago

We could have a "processed" folder in each dataset folder where we write the dataset object the first time we create it. In the following creations, e.g. d = MNIST() we just load the JLD2 file.

Example:

function MNIST(...)
    dataset_dir = ...
    processed_file = joinpath(dataset_dir, "processed", "dataset.jld2") 
    if isfile(processed_file) 
        return FileIO.load(processed_file, "dataset")
    end 

    mnist = ...
    if isfile(processed_file) 
        FileIO.save(processed_file, Dict("dataset" => mnist))
    end 
    return mnist
end
lorenzoh commented 2 years ago

Have done this for large vision datasets like COCO that have annotations in JSON which can be slow to parse. One thing to keep in mind is the size of the JLD2 files, though of course it shouldn't be a problem for MNIST. Arrow.jl can also be a good format with built-in compression when the data has samples made up of primitive types and arrays.

CarloLucibello commented 2 years ago

What's to be expected from the JLD2 sizes? hopefully not larger than the size of the original data, right?

lorenzoh commented 2 years ago

Depends. If you have a large dataset of .jpg images and store them as arrays (hence losslessly), size can be multiples.

zsz00 commented 2 years ago

I agree too Arrow.jl is a good format:

  1. built-in compression
  2. Cross-language processing dataset
CarloLucibello commented 2 years ago

HuggingFace's datasets library also uses Arrow: https://huggingface.co/docs/datasets/about_arrow

CarloLucibello commented 1 year ago

some code showing how to read/write color arrays from/to arrow tables https://gist.github.com/CarloLucibello/51d713ec4a1612b46e6c90e53c0f88e8