Closed michelole closed 5 years ago
Reducing the number of words to 200 decreases accuracy on test data from 80.95% to 79.25%.
If we then remove stopwords, accuracy drops to 76.48%.
I'll then keep Weka's default of 1000 tokens, which is not that much larger than the number of docs.
We have more dimensions (1000) than documents (~200). This is a basic ML mistake, so fix it.