dominikmn / one-million-posts

Assisting newspaper moderators with machine learning.
MIT License
2 stars 1 forks source link

CrossValidation results suffering from oversampling/augmentation #97

Open dominikmn opened 3 years ago

dominikmn commented 3 years ago

Problem

In our current GridSearch approach we train the models on the oversampled/augmented train set. On the same set, we do perform the cross-validation. This is a problem as the model sees samples in the validation-split that it already saw in the train-split. Hence, models that overfit will be favored by the GridSearch.

Resources

https://imbalanced-learn.org/dev/miscellaneous.html#custom-samplers