dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9k stars 1.88k forks source link

Cross validation with stratified folds #4396

Open quantasm opened 4 years ago

quantasm commented 4 years ago

There does not appear to be a way to stratify data in ML.NET, is this likely to be implemented anytime soon?

Say I have data that has an uneven predictor field split 90% / 10%, I would like to cross-validate the data with k folds so that each fold will produce an even predictor split of 50% / 50% (or any desired split setting value).

This does not seem possible yet but is a major feature that is required as part of ML modelling.

codemzs commented 4 years ago

@CESARDELATORRE FYI

CESARDELATORRE commented 4 years ago

@quantasm - This is related to this issue: https://github.com/dotnet/machinelearning/issues/4082 It's something we have identified and have in the backlog.

Adding @gvashishtha to follow up on this feature.

gvashishtha commented 4 years ago

@quantasm Can you explain more about your use case? The stratification features in scikit learn both preserve the original distribution in the data. So you would get a 90%/10% split in your train and test datasets.

It seems that what you are after is balance, which would generally be achieved by up- or down-sampling. Is this true?