247-ai / FlashML

FlashML from [24]7.ai: A library for automated model training on Apache Spark
Apache License 2.0
1 stars 3 forks source link

Implement stratified sampling for fold level splits #8

Open jithin247 opened 4 years ago

jithin247 commented 4 years ago

Currently, when we run CV for multiclass data, the fold level splits are randomly generated, so the class distribution are not maintained. These can result in wide variance in fold-level scores.

We need to implement stratified sample generation for folds.

Note: Spark has capability for stratified sampling, but that is not applied to folds. Can you check this out to make sure that this understanding is correct?

jithin247 commented 4 years ago

Check out this code: https://stackoverflow.com/questions/38610660/how-to-get-stratifiedkfold-in-scala-spark-mllib