Open KOLANICH opened 5 years ago
AFAIK, David prepared several datasets to include them in sklearn.datasets, but the first pull request (https://github.com/scikit-learn/scikit-learn/pull/12459) was turned down by the Sklearn maintainers. They said that we should upload the datasets to OpenML instead. Thus, he never bothered to submit the remaining PRs.
In addition, some of the datasets repositories, such as CRAN, are not in a standardized format, so they are not eligible to be added to Sklearn.
They said that we should upload the datasets to OpenML instead.
OpenML is a good project (though buggy, I have tried to upload there some datasets and failed - the server returns a very strange error), but there are other repos of datasets. We need to do better. #10
I have implemented a fetcher of RDatasets, but it doesn't solve all the problems.
Your fetcher fetches from a curated list of datasets, standardized to a CSV format. The fetcher provided here, fetches datasets from arbitrary R packages (if they are in the RData format). Neither CRAN nor R force developers to follow a particular structure in their datasets. Thus, it is impossible to (in general) translate them to the Sklearn format automatically.
Your ideas sound intriguing. I did not know about https://frictionlessdata.io/specs/data-package/ and I think it is worth a look.
I think moving this to scikit-learn-contrib as a separate project would the best option.
However, as Carlos just pointed out, the sklearn maintainers were not willing to integrate our fetchers into sklearn and I doubt they will accept that we take control of the sklearn.datasets code here.
In my opinion, it makes sense that the toy dataset loaders (iris, boston, etc., basically the ones that are so small that are included as files in the sklearn repo), the synthetic data generators (sklearn.datasets.samples_generator), and the datasets core (the Bunch class, some utility functions, etc.) are kept in sklearn. The small datasets are great to test new functionality in sklearn, are included in the repo, and do not define any validation partition, test partition, or CV folds.
I think only the big dataset fetchers (openml, kddcup, etc) should be moved to a separate dataset repo with our fetchers. In the first place because they have to be downloaded "on the fly", but also because big datasets usually pack a predefined validation or test partition, and in some cases even predefined CV folds, that have to be treated carefully with CV generators if one wants to make the experiments reproducible (take a look to my raetsch, libsvm, or keel fetchers for example).
What do you think? Unfortunately, right now I lack the time to discuss this with the people in charge of sklearn or sklearn-contrib, but I think this would be a nice initiative.
In my opinion, it makes sense that the toy dataset loaders (iris, boston, etc., basically the ones that are so small that are included as files in the sklearn repo), the synthetic data generators (sklearn.datasets.samples_generator), and the datasets core (the Bunch class, some utility functions, etc.) are kept in sklearn.
I agree. I have a PR into sklearn which may be useful for this repo: https://github.com/scikit-learn/scikit-learn/pull/12721
I think only the big dataset fetchers (openml, kddcup, etc) should be moved to a separate dataset repo with our fetchers.
I don't think so. I think we need to separate code from data (in the way not allowing data to be used for Turing-complete computations), and the list of URIs, specs should belong to data, and only data is to be downloaded from the network on the fly, but the code should be static.
Sorry for the late response, I'm quite busy lately.
I've been thinking on this and I agree with you. I like the idea of decoupling sklearn algorithms from the data. Moreover, this could be a nice opportunity to align with sklearn-pandas.
It is a bit inconvenient to use tonn of packages.
So, if you are serious about creating a repo of code allowing importing datasets,
sklearn.datasets
should be moved here and this repo should be transfered to the org, and then we should do a lot of work about creating a fully machine-readable package of datasets.If you are not serious, consider moving the code to sklearn.datasets.