FluxML / FastAI.jl

Repository of best practices for deep learning in Julia, inspired by fastai
https://fluxml.ai/FastAI.jl
MIT License
585 stars 52 forks source link

Dataset recipes #153

Open lorenzoh opened 2 years ago

lorenzoh commented 2 years ago

With #151, FastAI.jl is getting high-level interfaces for searching datasets (finddatasets) and loading datasets into task-specific data containers (loaddataset). There is also a new DatasetRecipe that encapsulates configuration for loading a data container and the block information from a path. These recipes can be registered with a dataset so that they can be found using the above high-level functions.

The fastai dataset colletion comes with quite a lot of datasets, so only a few have recipes yet. This issue tracks the progress on adding recipes to all the datasets. Contributions of recipe types and recipe configs for datasets are welcome.

See src/datasets/recipes.jl for example recipe implementations and src/datasets/fastairegistry for how recipes are registered. listdatasources() gives you a list of all dataset sources and datasetpath(name) downloads them and returns the download folder.

Progress

For datasets that can be used for multiple tasks, they are listed below. Otherwise a checked dataset that at least one recipe is already implemented.

ToucheSir commented 2 years ago

Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.

Edit: ref. https://github.com/JuliaML/MLDatasets.jl/issues/73 as well.

darsnack commented 2 years ago

It might be worth also looking at DataSets.jl announced at JuliaCon.

lorenzoh commented 2 years ago

Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.

At some point, all the dataset functionality should me merged down to MLDatasets.jl and MLDataPattern.jl.

The registry itself is pretty barebones; if you take away the functionality related to blocks, then you could replace it with a Dict{String, Vector{DatasetRecipe}} that maps a list of recipes to a dataset.

lorenzoh commented 2 years ago

At some point we'll have to think about iterable datasets and at that point some rearchitecting DataSets.jl could be useful. It should also not be too hard to add iterable support to DataLoaders.jl.

For now I want to provide a useful core of offline datasets here in FastAI.jl with this simple approach. Rearchitecting should probably flow into the efforts in MLDatasets.jl (or perhaps a DLDatasets.jl if everything will be deprecated anyway?). I'll give a larger reply in https://github.com/JuliaML/MLDatasets.jl/issues/73 later

In any case, any recipe logic associated with the fastai datasets here should be easily relocatable later. šŸ‘

lorenzoh commented 2 years ago

Some are being added in #163

Chandu-4444 commented 2 years ago

Hey, I'd like to work on this issue. Since this issue is labeled good first issue I believe I can help. Can you please specify to me what has to be done still cause I see the list above hasn't been updated?

lorenzoh commented 2 years ago

Hey! The list above is uptodate. The easiest thing to get started with should be adding recipes for the csv datasets and registering some TableDatasetRecipes.

Chandu-4444 commented 2 years ago

Next I want to add recipes for dbpedia_csv, ag_news_csv. They all are in CSV format. But the labels were in separate files and the indexes of these labels are used in the actual CSV files. In that case, I think it is better to replace the label indices with the actual labels in the recipe code itself and then wrap it with TableClassificationRecipe? Are there any ideas to do this?

lorenzoh commented 2 years ago

Might need a new recipe type that wraps TableRecipe, but can't say without looking at the folder structure

Chandu-4444 commented 2 years ago

fastai-dbpedia_csv/ ā””ā”€ā”€ dbpedia_csv      ā”œā”€ā”€ classes.txt      ā”œā”€ā”€ readme.txt      ā”œā”€ā”€ test.csv      ā””ā”€ā”€ train.csv

This is the folder structure for both datasets (dbpedia_csv, ag_news_csv).

Chandu-4444 commented 2 years ago

Might need a new recipe type that wraps TableRecipe, but can't say without looking at the folder structure

Is it necessary to make a new recipe for datasets that have folder structures similar to the one above? Or is it possible to tweak the existing ones to get the job done?

lorenzoh commented 2 years ago

I think in this case it may be possible to create a new recipe that wraps TableRecipe (which loads the table) and then reads in the labels and converts label indices to label strings. I don't have the bandwidth to look into this in more detail currently, though.

Chandu-4444 commented 2 years ago

I think in this case it may be possible to create a new recipe that wraps TableRecipe (which loads the table) and then reads in the labels and converts label indices to label strings.

I'll work on this.

arcAman07 commented 2 years ago

After the community meet, I explored fastAI, MLutils and couple of other libraries and tried to understand the codebase specifically . Would love to get started with adding a dataset , can you please specify which one of the above would be a good one to get started into , also I believe the list above isnt updated