dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.94k stars 1.86k forks source link

Find a new regression dataset #938

Open artidoro opened 5 years ago

artidoro commented 5 years ago

Some regression tests rely on a machine generated regression dataset (Gaussian noise on top of a linear function of a vector input). The file was introduced by #937.

We should replace this dataset with a real dataset. Justin @justinormont suggested to find something from data.gov, for example predicting the SF employee pay: https://catalog.data.gov/dataset/employee-compensation-53987

wschin commented 5 years ago

LIBSVM dataset is also commonly used in researches.

rogancarr commented 5 years ago

We have the following data sets that can be used as regression:

The following can be reformulated to use as a regression prediction:

codemzs commented 5 years ago

Rogan seems to have answered this question.

justinormont commented 5 years ago

The work item is to replace the synthetic datasets w/ ones more representative of user datasets. Rogan has pointed out great ones we can use as replacements in our tests.

codemzs commented 5 years ago

@justinormont The ones that Rogan pointed out are real datasets, breast-cancer dataset is from 1992.