Currently, when we run CV for multiclass data, the fold level splits are randomly generated, so the class distribution are not maintained. These can result in wide variance in fold-level scores.
We need to implement stratified sample generation for folds.
Note: Spark has capability for stratified sampling, but that is not applied to folds. Can you check this out to make sure that this understanding is correct?
Currently, when we run CV for multiclass data, the fold level splits are randomly generated, so the class distribution are not maintained. These can result in wide variance in fold-level scores.
We need to implement stratified sample generation for folds.
Note: Spark has capability for stratified sampling, but that is not applied to folds. Can you check this out to make sure that this understanding is correct?