arendakessian / spring2020-ml-project

fake review detection system
2 stars 3 forks source link

LogReg + small Gradient Boosting modeling #10

Closed guidopetri closed 4 years ago

guidopetri commented 4 years ago

As part of the modeling process, we want to look at Logistic Regression (since we saw this in class) and how it performs on our dataset. I'm also a little curious as to what gradient boosting would yield on the dataset too, since personally I've always had a good experience with GBs.

LR won't have too many parameters to tune - there's only two classes, so no multiclass OvA OvO stuff. I'm guessing it's mostly just controlling for the regularization amount. This'll probably run by super quick so I'll probably try a lot of values here.

As for GB - we don't really need to have a GB model, but I'm going to explore it too (it also makes our workload a bit more fair haha). This is a sequential model so it'll probably take a long time to run, and it also considers a lot of features just like RFs do - so I imagine it'll probably be one of those things that I turn it on and 3 days later check the results. Hopefully they're at least somewhat good. There's a lot of parameters to tweak here and I have no priors as to what might be good, so here's hoping.

guidopetri commented 4 years ago

LR so far:

Vectorizer: count
ROC: 0.6824659601824518
AP: 0.1561411056103147
Vectorizer: tfidf
ROC: 0.6927763586025801
AP: 0.16498424060330158
Vectorizer: hashing
ROC: 0.5
AP: 0.101564675093268
Vectorizer: binary
ROC: 0.6849821051408348
AP: 0.15780968628581854
Vectorizer: hashing_binary
ROC: 0.5
AP: 0.101564675093268

:/

edit:

with predict_proba and ignoring the hashing vectorizer types:

Vectorizer: count
ROC:        0.7492
AP:         0.2328
Precision:  0.1665
Recall:     0.8410

Vectorizer: tfidf
ROC:        0.7556
AP:         0.2353
Precision:  0.1831
Recall:     0.7780

Vectorizer: binary
ROC:        0.7466
AP:         0.2274
Precision:  0.1692
Recall:     0.8314