dotnet / machinelearning-samples

Samples for ML.NET, an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
4.47k stars 2.68k forks source link

[AutoML API Samples] Move Datasets used by each sample to its own sample's folder #409

Open CESARDELATORRE opened 5 years ago

CESARDELATORRE commented 5 years ago

@cartacioS @daholste (For next week, after //BUILD/, not urgent. To be done with a PR against MASTER).

Since most datasets used by the samples are small, we're selecting the approach of having each dataset placed within each sample's folder, so each sample is more "autonomous". Sure, sometimes there might be redundancy, but some samples might choose to change the dataset, headers, etc. so it will not impact other samples.

Current AutoML samples have the datasets in a central folder in the root.

Please, when possible let's step out of that approach and move each dataset to its own sample.

justinormont commented 5 years ago

We should start a discussion on the longer term organization of the repo.

The upside of each sample having its own dataset is the self-contained nature, where all parts of a sample are completely independent from the rep; for instance we may want to munge dataset to better show off the feature being demonstrated. The largest downsides I see is the larger space required for the duplicate datasets, and the lack a central location to locate datasets for new examples.

Currently, it takes me 3min to clone the samples repo, and it creates a 1.8GB folder.

We may want to further remove the datasets externally. As a comparison point, TensorFlow has a repo of datasets, exposed to code as tf.data.Datasets; I assume the datasets are dynamically grabbed from a CDN as their repo takes 20s to clone and creates a 130MB folder. Dataset list: https://www.tensorflow.org/datasets/datasets