Open lorenzoh opened 2 years ago
Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.
Edit: ref. https://github.com/JuliaML/MLDatasets.jl/issues/73 as well.
It might be worth also looking at DataSets.jl announced at JuliaCon.
Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.
At some point, all the dataset functionality should me merged down to MLDatasets.jl and MLDataPattern.jl.
The registry itself is pretty barebones; if you take away the functionality related to blocks, then you could replace it with a Dict{String, Vector{DatasetRecipe}}
that maps a list of recipes to a dataset.
At some point we'll have to think about iterable datasets and at that point some rearchitecting DataSets.jl could be useful. It should also not be too hard to add iterable support to DataLoaders.jl.
For now I want to provide a useful core of offline datasets here in FastAI.jl with this simple approach. Rearchitecting should probably flow into the efforts in MLDatasets.jl (or perhaps a DLDatasets.jl if everything will be deprecated anyway?). I'll give a larger reply in https://github.com/JuliaML/MLDatasets.jl/issues/73 later
In any case, any recipe logic associated with the fastai datasets here should be easily relocatable later. š
Some are being added in #163
Hey, I'd like to work on this issue. Since this issue is labeled good first issue I believe I can help. Can you please specify to me what has to be done still cause I see the list above hasn't been updated?
Hey! The list above is uptodate. The easiest thing to get started with should be adding recipes for the csv datasets and registering some TableDatasetRecipe
s.
Next I want to add recipes for dbpedia_csv
, ag_news_csv
. They all are in CSV format. But the labels were in separate files and the indexes of these labels are used in the actual CSV files. In that case, I think it is better to replace the label indices with the actual labels in the recipe code itself and then wrap it with TableClassificationRecipe
? Are there any ideas to do this?
Might need a new recipe type that wraps TableRecipe
, but can't say without looking at the folder structure
fastai-dbpedia_csv/ āāā dbpedia_csv āāā classes.txt āāā readme.txt āāā test.csv āāā train.csv
This is the folder structure for both datasets (dbpedia_csv, ag_news_csv).
Might need a new recipe type that wraps
TableRecipe
, but can't say without looking at the folder structure
Is it necessary to make a new recipe for datasets that have folder structures similar to the one above? Or is it possible to tweak the existing ones to get the job done?
I think in this case it may be possible to create a new recipe that wraps TableRecipe
(which loads the table) and then reads in the labels and converts label indices to label strings. I don't have the bandwidth to look into this in more detail currently, though.
I think in this case it may be possible to create a new recipe that wraps
TableRecipe
(which loads the table) and then reads in the labels and converts label indices to label strings.
I'll work on this.
After the community meet, I explored fastAI, MLutils and couple of other libraries and tried to understand the codebase specifically . Would love to get started with adding a dataset , can you please specify which one of the above would be a good one to get started into , also I believe the list above isnt updated
With #151, FastAI.jl is getting high-level interfaces for searching datasets (
finddatasets
) and loading datasets into task-specific data containers (loaddataset
). There is also a newDatasetRecipe
that encapsulates configuration for loading a data container and the block information from a path. These recipes can be registered with a dataset so that they can be found using the above high-level functions.The fastai dataset colletion comes with quite a lot of datasets, so only a few have recipes yet. This issue tracks the progress on adding recipes to all the datasets. Contributions of recipe types and recipe configs for datasets are welcome.
See
src/datasets/recipes.jl
for example recipe implementations andsrc/datasets/fastairegistry
for how recipes are registered.listdatasources()
gives you a list of all dataset sources anddatasetpath(name)
downloads them and returns the download folder.Progress
For datasets that can be used for multiple tasks, they are listed below. Otherwise a checked dataset that at least one recipe is already implemented.
(Image{2}, LabelMulti)
)