arendakessian / spring2020-ml-project

fake review detection system
2 stars 3 forks source link

Results submission #13

Closed guidopetri closed 4 years ago

kelseymarkey commented 4 years ago

UPDATED 5/15 (bold = to include in table)

Multinomial Naive Bayes (downsampled, predict_proba):

-----  count  -----
{'alpha': 0.2733556118022408}
Dev   AUC:        0.7248
Dev   AP:         0.2031
Dev   Precision:  0.1930
Dev   Recall:     0.6143
**-----  tfidf  -----
{'alpha': 0.17693089816649998}
Dev   AUC:        0.7499
Dev   AP:         0.2294
Dev   Precision:  0.1920
Dev   Recall:     0.7297**
-----  binary  -----
{'alpha': 0.2733556118022408}
Dev   AUC:        0.7250
Dev   AP:         0.2032
Dev   Precision:  0.1905
Dev   Recall:     0.6225

Bernoulli Naive Bayes (downsampled, predict_proba): NOTE: All vectorizers performed the exact same with Bernoulli NB. I think this is because Bernoulli NB only takes binary input so it binarizes all vectors before modeling (and maybe that result is the same for all vectorizers?). Apparently it is standard to use a CountVectorizer with binary=True for Bernoulli NB, so I would report only that one in the final paper.

-----  count  -----
{'alpha': 0.0007614091062416327}
Dev   AUC:        0.6734
Dev   AP:         0.1608
Dev   Precision:  0.1442
Dev   Recall:     0.7733
-----  tfidf  -----
{'alpha': 0.0007614091062416327}
Dev   AUC:        0.6734
Dev   AP:         0.1608
Dev   Precision:  0.1442
Dev   Recall:     0.7733
**-----  binary  -----
{'alpha': 0.0007614091062416327}
Dev   AUC:        0.6734
Dev   AP:         0.1608
Dev   Precision:  0.1442
Dev   Recall:     0.7733**

Complement Naive Bayes (entire dataset, predictproba): (This was done since it is "particularly suited for imbalanced data sets".)_

-----  count  -----
{'alpha': 0.2733556118022408}
Dev   AUC:        0.7008
Dev   AP:         0.2143
Dev   Precision:  0.4737
Dev   Recall:     0.0148
-----  tfidf  -----
{'alpha': 0.2733556118022408}
Dev   AUC:        0.7182
Dev   AP:         0.2085
Dev   Precision:  0.0000
Dev   Recall:     0.0000
-----  binary  -----
{'alpha': 0.2733556118022408}
Dev   AUC:        0.7016
Dev   AP:         0.2142
Dev   Precision:  0.4673
Dev   Recall:     0.0137

I did some Naive Bayes upsampling vs downsampling investigations that I'll report separately from this.

arendakessian commented 4 years ago

LINEAR PERCEPTRON

with decision_function (equivalent to predict_proba), no penalty as deemed by RandomSearch

-----count------------
Val AUC 0.7060673349928509
Val AP 0.17427758321442435

--------tfidf-----------
Val AUC 0.7394226525165951
Val AP 0.2157023479250944

---------binary---------
Val AUC 0.6496913251471955
Val AP 0.14967082209990168

LINEAR SVM Regularized (C=1.0), Squared Hinge Loss as Determined by RandomSearch

-------------count-------------
Val AUC 0.7507259837160689
Val AP 0.2289730118025493

--------tfidf--------------------
Val AUC 0.7522518462302721
Val AP 0.23281083589105311

--------binary----------------
Val AUC 0.7493476395367485
Val AP 0.23021511582458373

TFIDF did not converge (with iter=1000).

FOOTNOTE: non-linear SVM was too computationally expensive, but from what was observed it was already performing significantly worse than Linear SVM

guidopetri commented 4 years ago

Gboost:

Vectorizer: count
ROC: 0.6628645400105468
AP: 0.17506248036747557

Vectorizer: tfidf
ROC: 0.6651296421639784
AP: 0.16772558247399094

Vectorizer: binary
ROC: 0.66271509763427
AP: 0.17489571085934696

on validation.

LR:

Vectorizer: count
ROC:        0.7474
AP:         0.2214

Vectorizer: tfidf
ROC:        0.7398
AP:         0.2233

Vectorizer: binary
ROC:        0.7332
AP:         0.2095

with shuffling the train/dev data (I guess shuffling dev is probably not important) on validation.

guidopetri commented 4 years ago

More LR. I did all of these with ngram_range=(1, 2).

No mod:

Vectorizer: tfidf
ROC: 0.7496553375031941
AP: 0.2292586687575256

Only last 4 features (the ones we created):

Vectorizer: tfidf
ROC: 0.6990534098600623
AP: 0.18295645913095684

All except last 4:

Vectorizer: tfidf
ROC: 0.6932794380881706
AP: 0.1917229670757335

With MaxAbsScaler (aka MinMaxScaler but for sparse):

Vectorizer: tfidf
ROC: 0.7220654036460457
AP: 0.21154852506869354

With TruncatedSVD (aka PCA but for sparse), 100 components:

Explained variance: 0.9828133014622754
Vectorizer: tfidf
ROC: 0.7428906075859388
AP: 0.22219139784666178

This failed to converge so it might be off.

With TruncatedSVD, 10 components:

Explained variance: 0.9819485010960559
Vectorizer: tfidf
ROC: 0.7173738092180015
AP: 0.20282451460762962

With L1 penalty:

Vectorizer: tfidf
ROC: 0.7496553375031941
AP: 0.2292586687575256

Limiting to only ngrams that show up in more than x emails, where x is the mean:

Vectorizer: tfidf
ROC: 0.75481085951049
AP: 0.23154535530707157

This also failed to converge.

With x as 30, like in HW1:

Vectorizer: tfidf
ROC: 0.7546503613290275
AP: 0.23150355562945776

Also failed to converge.

With scaling and PCA for 100 components:

Explained variance: 0.021099765181157443
Vectorizer: tfidf
ROC: 0.72265629247332
AP: 0.20918171284923562

Only last 4 features, with scaling:

Vectorizer: tfidf
ROC: 0.6982969940102426
AP: 0.18271077354771023

All except last 4, with scaling:

Vectorizer: tfidf
ROC: 0.7078180215315948
AP: 0.2012219303121121

Looking at these, I think it's clear that limiting the feature space (either through L1 regularization or artificially by saying "it has to show up in x documents") seems to be the best choice here.

guidopetri commented 4 years ago

RF:

TFIDF:

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=550, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=128,
n_jobs=None, oob_score=False, random_state=22, verbose=0,
warm_start=False)
Train AUC: 0.9970
Train AP: 0.9969
Train Precision: 0.8135
Train Recall: 0.9999
Dev AUC: 0.6885
Dev AP: 0.1857
Dev Precision: 0.1393
Dev Recall: 0.8358

Count:

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=206, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=10,
min_weight_fraction_leaf=0.0, n_estimators=150,
n_jobs=None, oob_score=False, random_state=22, verbose=0,
warm_start=False)
Train AUC: 0.9302
Train AP: 0.9232
Train Precision: 0.7098
Train Recall: 0.9934
Dev AUC: 0.6782
Dev AP: 0.1788
Dev Precision: 0.1366
Dev Recall: 0.8273

Binary:

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=206, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=10,
min_weight_fraction_leaf=0.0, n_estimators=150,
n_jobs=None, oob_score=False, random_state=22, verbose=0,
warm_start=False)
Train AUC: 0.9299
Train AP: 0.9226
Train Precision: 0.7125
Train Recall: 0.9931
Dev AUC: 0.6766
Dev AP: 0.1827
Dev Precision:  0.1363
Dev Recall: 0.8213
guidopetri commented 4 years ago

More SVM:

no mods
Val AUC 0.7514917776749358
Val AP 0.2317726239381011
-----------------------------
last4
Val AUC 0.6941519462634351
Val AP 0.18116968617844315
-----------------------------
all but last 4
Val AUC 0.7125164711534802
Val AP 0.2053572619567911
-----------------------------
scaling
Val AUC 0.7334506913637129
Val AP 0.21935211298471108
-----------------------------
truncatedSVD100
Val AUC 0.7402465754611585
Val AP 0.2219357880331605
-----------------------------
truncatedSVD10
Val AUC 0.714849059164995
Val AP 0.20162762249021143
-----------------------------
with L1
Val AUC 0.7204560173481427
Val AP 0.20606592195047516
-----------------------------
ngram_mean
Val AUC 0.7512333615016391
Val AP 0.2316057835286238
-----------------------------
ngram_30
Val AUC 0.7503096814704876
Val AP 0.23059991152936476
-----------------------------
scaling + svd100
Val AUC 0.6917416193343989
Val AP 0.19332778082469648
-----------------------------
last 4 + scaling
Val AUC 0.6903680151775861
Val AP 0.176578647251165

More MNB:

no mods
VAL AUC 0.7498575657215163
VAL AP 0.2293512205357953
-----------------------------
last4
VAL AUC 0.6581026055173183
VAL AP 0.1437128131114489
-----------------------------
all but last 4
VAL AUC 0.7070023469057676
VAL AP 0.20591464071739765
-----------------------------
scaling
VAL AUC 0.6536068300836146
VAL AP 0.159505790364585
-----------------------------
truncatedSVD100
VAL AUC 0.7096800773626037
VAL AP 0.18745670852850765
-----------------------------
truncatedSVD10
VAL AUC 0.6878866303842578
VAL AP 0.15705489949022805
-----------------------------
ngram_mean
VAL AUC 0.7157689633180022
VAL AP 0.18875477473370467
-----------------------------
ngram_30
VAL AUC 0.6931802543914015
VAL AP 0.16490223532898143
-----------------------------
scaling + svd100
VAL AUC 0.67029364609327
VAL AP 0.1807745196643682
-----------------------------
last 4 + scaling
VAL AUC 0.6793984223370249
VAL AP 0.16104718364310758

Clearly the best are just the standard MNB/SVM. Was kind of hoping for some better results.

guidopetri commented 4 years ago

I think we're done here :)