LogReg + small Gradient Boosting modeling

As part of the modeling process, we want to look at Logistic Regression (since we saw this in class) and how it performs on our dataset. I'm also a little curious as to what gradient boosting would yield on the dataset too, since personally I've always had a good experience with GBs.

LR won't have too many parameters to tune - there's only two classes, so no multiclass OvA OvO stuff. I'm guessing it's mostly just controlling for the regularization amount. This'll probably run by super quick so I'll probably try a lot of values here.

As for GB - we don't really need to have a GB model, but I'm going to explore it too (it also makes our workload a bit more fair haha). This is a sequential model so it'll probably take a long time to run, and it also considers a lot of features just like RFs do - so I imagine it'll probably be one of those things that I turn it on and 3 days later check the results. Hopefully they're at least somewhat good. There's a lot of parameters to tweak here and I have no priors as to what might be good, so here's hoping.

LR so far:

Vectorizer: count
ROC: 0.6824659601824518
AP: 0.1561411056103147
Vectorizer: tfidf
ROC: 0.6927763586025801
AP: 0.16498424060330158
Vectorizer: hashing
ROC: 0.5
AP: 0.101564675093268
Vectorizer: binary
ROC: 0.6849821051408348
AP: 0.15780968628581854
Vectorizer: hashing_binary
ROC: 0.5
AP: 0.101564675093268

edit:

with predict_proba and ignoring the hashing vectorizer types:

Vectorizer: count
ROC:        0.7492
AP:         0.2328
Precision:  0.1665
Recall:     0.8410

Vectorizer: tfidf
ROC:        0.7556
AP:         0.2353
Precision:  0.1831
Recall:     0.7780

Vectorizer: binary
ROC:        0.7466
AP:         0.2274
Precision:  0.1692
Recall:     0.8314

arendakessian / spring2020-ml-project

LogReg + small Gradient Boosting modeling #10