bst-mug / n2c2

Support code for participation at the 2018 n2c2 Shared-Task Track 1
https://n2c2.dbmi.hms.harvard.edu
Apache License 2.0
6 stars 4 forks source link

[SVM] Reduce number of dimensions #70

Closed michelole closed 5 years ago

michelole commented 5 years ago

We have more dimensions (1000) than documents (~200). This is a basic ML mistake, so fix it.

michelole commented 5 years ago

Reducing the number of words to 200 decreases accuracy on test data from 80.95% to 79.25%.

If we then remove stopwords, accuracy drops to 76.48%.

I'll then keep Weka's default of 1000 tokens, which is not that much larger than the number of docs.