MichaelAquilina / SpamFilter

Classification of emails using machine learning and natural language processing techniques in Java
5 stars 4 forks source link

CV should we care about equal distribution of spam vs ham in each fold? #15

Closed xhochy closed 10 years ago

MichaelAquilina commented 10 years ago

I think cross validation is meant to be random. In our training data there is a clear bias towards Ham. Taking a uniform distribution should allow us to generate folds that contain roughly the same distribution of ham vs spam.