dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.01k stars 1.88k forks source link

Cross-Validation experiment needs more parameters and info #1492

Closed AleMiguelMicrosoft closed 5 years ago

AleMiguelMicrosoft commented 5 years ago

When defining a cross validating experiment, I'd like to be able to define the proportion of records in the input set that are used in training vs testing. A common practice is specifying the % of records that need to go in the training set vs the testing set. Then use a random method of picking indexes from the data frame to form the training set based on the percentage defined before. And create a disjoint set (all indexes not picked during the first sweep) for testing.

In addition, the output of cross validation should show the size of the training set & fold sets after run.

artidoro commented 5 years ago

Could you be more specific on where this functionality is missing? If possible, could you give an code example of what you are looking for?

artidoro commented 5 years ago

So k-fold cross validation essentially splits the datasets in roughly k equal parts. It uses k-1 of them for training and one for testing, and iterates so that the section on which the algorithm is tested is always different.

The way the algorithm subdivides the dataset in k parts is by assigning a uniformly generated number between 0 and 1 to each row of a dataset. It then splits the dataset according to thresholds of 1/k on the random generated numbers. For example, if k = 3, we generate a random number R for every row of the dataset and we subdivide it into three sections: the first is going to be all rows that have R<0.33, the second is going to be all rows that have 0.33<R<0.66, and the rest is the third.

There is no built in way of knowing exactly how many samples it ran on. If you are interested in training and testing on sections of the dataset of specific size, I would recommend splitting your data in different files, and simply using the train/test/score standard functionalities in ML.NET.

AleMiguelMicrosoft commented 5 years ago

@artidoro this makes sense. Please close the issue. But consider using your explanation for documentation. Thank you.