JuliaML / MLDatasets.jl

Utility package for accessing common Machine Learning datasets in Julia
https://juliaml.github.io/MLDatasets.jl/stable
MIT License
228 stars 47 forks source link

Add ImageNet #146

Open adrhill opened 2 years ago

adrhill commented 2 years ago

Draft PR to add the ImageNet 2012 Classification Dataset (ILSVRC 2012-2017) as a ManualDataDep. Closes #100.


Since ImageNet is very large (>150 GB) and requires signing up and accepting the terms of access, it can only be added manually. The ManualDataDep instruction message for ImageNet includes the following:

When unpacked "PyTorch-style", the ImageNet dataset is assumed to look as follows: ImageNet -> split-folder -> WordNet ID folder -> class samples as jpg-files, e.g.:

ImageNet
├── train
├── val
│   ├── n01440764
│   │   ├── ILSVRC2012_val_00000293.JPEG
│   │   ├── ILSVRC2012_val_00002138.JPEG
│   │   └── ...
│   ├── n01443537
│   └── ...
├── test
└── devkit
    ├── data
    │   ├── meta.mat
    │   └── ...
    └── ...

Current limitations

Since ImageNet is too large to precompute all preprocessed images and keep them in memory, the dataset precomputes a list of all file paths instead. Calling Base.getindex(d::ImageNet, i) loads the image via ImageMagick.jl and preprocesses it when required. This adds dependencies on ImageMagick and Images.jl via LazyModules.

This also means that the ImageNet struct currently doesn't contain features (which might be a requirement for SupervisedDatasets?)

codecov-commenter commented 2 years ago

Codecov Report

Merging #146 (09d5be4) into master (86dabc4) will decrease coverage by 1.33%. The diff coverage is 6.75%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #146      +/-   ##
==========================================
- Coverage   48.56%   47.23%   -1.33%     
==========================================
  Files          44       47       +3     
  Lines        2261     2335      +74     
==========================================
+ Hits         1098     1103       +5     
- Misses       1163     1232      +69     
Impacted Files Coverage Δ
src/datasets/vision/imagenet_reader/preprocess.jl 0.00% <0.00%> (ø)
.../datasets/vision/imagenet_reader/ImageNetReader.jl 5.00% <5.00%> (ø)
src/datasets/vision/imagenet.jl 7.31% <7.31%> (ø)
src/MLDatasets.jl 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

lorenzoh commented 2 years ago

Will be good to have ImageNet support!

I'm wondering if there may be a simpler implementation for this, though. It seems the dataset has the same format as the (derived) ImageNette and ImageWoof datasets. The way those are loaded in FastAI.jl combines the MLUtils.jl primitives and those could be used to load ImageNet as folows:

using MLDatasets, MLUtils, FileIO

function ImageNet(dir)
    files = FileDataset(identity, path, "*.JPEG").paths
    return mapobs((FileIO.load, loadlabel), files)
end
# get the class name from the file path. could add a lookup here to convert the ID to the human-readable name
loadlabel(file::String) = split(file, "/")[end-2]

data  = ImageNet(IMAGENET_DIR)

# only training set
data  = ImageNet(joinpath(IMAGENET_DIR, "train"))

I'd also suggest using FileIO.jl for loading images which will the faster JpegTurbo.jl to load the images.

If more control over the image loading is desired, like converting to a color upon reading or loading an image into a smaller size (much faster if it'll be downsized during training anyway) , one could also use JpegTurbo.jl directly:

function ImageNet(dir; C = RGB{N0f8}, preferred_size = nothing)
    files = FileDataset(identity, path, "*.JPEG").paths
    return mapobs((f -> JpegTurbo.jpeg_decode(C, f; preferred_size), loadlabel), files)
end

# load as grayscale and smaller image size
data = ImageNet(IMAGENET_DIR; C = Gray{N0f8}, preferred_size = (224, 224))
adrhill commented 2 years ago

Thanks a lot, loading smaller images with JpegTurbo is indeed much faster! I've also added a lookup-table wnid_to_label to the metadata. Once you know the label, you can access class names and descriptions by indexing the corresponding metadata entries, e.g. metadata["class_names"][label].

adrhill commented 2 years ago

JpegTurbo's preferred_size keyword already returns images pretty close to the desired 224x224 size. At the cost of losing a couple of pixels, we could skip the second resizing in resize_smallest_dimension, which allocates and instead directly center_crop, which is just a view.

adrhill commented 2 years ago

I've done some local benchmarks: Current commit cac14d2 with JpegTurbo loading smaller images:

julia> using MLDatasets

julia> dataset = ImageNet(Float32, :val);

julia> @benchmark dataset[1:16]
BenchmarkTools.Trial: 44 samples with 1 evaluation.
 Range (min … max):  104.413 ms … 143.052 ms  ┊ GC (min … max):  7.28% … 18.57%
 Time  (median):     113.164 ms               ┊ GC (median):    10.80%
 Time  (mean ± σ):   115.515 ms ±   9.030 ms  ┊ GC (mean ± σ):  10.46% ±  3.68%

    ▃          █                                                 
  ▇▄█▄▁▁▄▄▇▇▇▄▄█▄▇▄▄▇▁▄▁▄▁▁▄▁▄▄▁▄▁▁▄▄▁▁▁▄▁▁▁▁▁▄▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▄ ▁
  104 ms           Histogram: frequency by time          143 ms <

 Memory estimate: 131.78 MiB, allocs estimate: 2050.

Without resize_smallest_dimension, using only center_crop:

julia> @benchmark dataset[1:16]
BenchmarkTools.Trial: 57 samples with 1 evaluation.
 Range (min … max):  80.594 ms … 103.226 ms  ┊ GC (min … max):  7.43% … 19.03%
 Time  (median):     86.954 ms               ┊ GC (median):     8.95%
 Time  (mean ± σ):   88.287 ms ±   5.683 ms  ┊ GC (mean ± σ):  10.90% ±  3.57%

    ▄ ▄ ▁▁ █▄  ▁▄   ▁    ▁  ▁▁ ▁   ▄   ▁ ▁   ▁           ▁      
  ▆▆█▆█▁██▁██▆▁██▆▁▆█▁▁▆▁█▁▆██▆█▁▁▁█▆▁▆█▁█▁▁▁█▁▁▁▆▁▁▁▁▁▁▁█▁▁▁▆ ▁
  80.6 ms         Histogram: frequency by time          101 ms <

 Memory estimate: 115.96 MiB, allocs estimate: 1826.

Additionally using StackedViews.jl for batching:

julia> @benchmark dataset[1:16]
BenchmarkTools.Trial: 95 samples with 1 evaluation.
 Range (min … max):  47.971 ms … 73.503 ms  ┊ GC (min … max): 0.00% … 8.68%
 Time  (median):     51.116 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.903 ms ±  4.922 ms  ┊ GC (mean ± σ):  4.69% ± 5.81%

  ▂ ▂▄█                                                        
  █▇█████▃▅▃▅▅▆▃█▇▇▅▅▅▁▅▁▁▃▃▆▆▁▁▁▁▁▅▃▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▃ ▁
  48 ms           Histogram: frequency by time          70 ms <

 Memory estimate: 38.43 MiB, allocs estimate: 1499.
adrhill commented 2 years ago

Thanks for the review @Dsantra92! I'm slightly busy due to the JuliaCon submission deadline on Monday, but I'll get back to this PR as soon as possible.

adrhill commented 2 years ago

The order of the classes in the metadata also still has to be fixed as it doesn't match https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt.

adrhill commented 2 years ago

Sorry for stalling this.

I guess the issue with this PR boils down to whether preprocessing functions belong to MLDatasets or to packages exporting pre-trained models. This question has already been raised in https://github.com/FluxML/Metalhead.jl/issues/117.

Since images in ImageNet have different dimensions, providing an ImageNet data loader with matching preprocessing functions would be somewhat useless, as it would not be able to load batches of data. And as discussed here in the context of JpegTurbo.jpeg_decode, a lot of performance would be left on the table if we loaded full-size images just to immediately resize them.

I took a look at how other Deep Learning frameworks deal with this and both torchvision and Keras Applications export preprocessing functions with their pre-trained models. MLDataset's FileDataset pattern would work well if pre-trained model libraries exported a corresponding loadfn. One of the issues mentioned in https://github.com/FluxML/Metalhead.jl/issues/117 is import latency for extra dependencies such as DataAugmentation.jl. Maybe LazyModules.jl could help circumvent this problem.

RomeoV commented 1 year ago

Hey everyone, this looks awesome. Is anyone still working on this? Otherwise I would suggest trying to merge this, even if it's not "perfect" with regards to extra dependencies or open questions about transformations.

adrhill commented 1 year ago

I'm still interested in working on this.

To get this merged, we could make the preprocess and inverse_preprocess functions part of the ImageNet struct and provide the current functions as defaults.

Edit: inverse_preprocess is now a field of the ImageNet struct, preprocess is the loadfn of the internal FileDataset.

CarloLucibello commented 1 year ago

this needs a rebase, otherwise looks mostly good

adrhill commented 8 months ago

In case someone is still interested in using this, I've opened a unregistered repository containing this PR: https://github.com/adrhill/ImageNetDataset.jl

The most notable difference is that ImageNetDataset.jl contains some custom preprocessing pipelines that support convert2image and work out of the box with Metalhead.jl.