Closed guidopetri closed 4 years ago
LINEAR PERCEPTRON
with decision_function (equivalent to predict_proba), no penalty as deemed by RandomSearch
-----count------------
Val AUC 0.7060673349928509
Val AP 0.17427758321442435
--------tfidf-----------
Val AUC 0.7394226525165951
Val AP 0.2157023479250944
---------binary---------
Val AUC 0.6496913251471955
Val AP 0.14967082209990168
LINEAR SVM Regularized (C=1.0), Squared Hinge Loss as Determined by RandomSearch
-------------count-------------
Val AUC 0.7507259837160689
Val AP 0.2289730118025493
--------tfidf--------------------
Val AUC 0.7522518462302721
Val AP 0.23281083589105311
--------binary----------------
Val AUC 0.7493476395367485
Val AP 0.23021511582458373
TFIDF did not converge (with iter=1000).
FOOTNOTE: non-linear SVM was too computationally expensive, but from what was observed it was already performing significantly worse than Linear SVM
Gboost:
Vectorizer: count
ROC: 0.6628645400105468
AP: 0.17506248036747557
Vectorizer: tfidf
ROC: 0.6651296421639784
AP: 0.16772558247399094
Vectorizer: binary
ROC: 0.66271509763427
AP: 0.17489571085934696
on validation.
LR:
Vectorizer: count
ROC: 0.7474
AP: 0.2214
Vectorizer: tfidf
ROC: 0.7398
AP: 0.2233
Vectorizer: binary
ROC: 0.7332
AP: 0.2095
with shuffling the train/dev data (I guess shuffling dev is probably not important) on validation.
More LR. I did all of these with ngram_range=(1, 2)
.
No mod:
Vectorizer: tfidf
ROC: 0.7496553375031941
AP: 0.2292586687575256
Only last 4 features (the ones we created):
Vectorizer: tfidf
ROC: 0.6990534098600623
AP: 0.18295645913095684
All except last 4:
Vectorizer: tfidf
ROC: 0.6932794380881706
AP: 0.1917229670757335
With MaxAbsScaler (aka MinMaxScaler but for sparse):
Vectorizer: tfidf
ROC: 0.7220654036460457
AP: 0.21154852506869354
With TruncatedSVD (aka PCA but for sparse), 100 components:
Explained variance: 0.9828133014622754
Vectorizer: tfidf
ROC: 0.7428906075859388
AP: 0.22219139784666178
This failed to converge so it might be off.
With TruncatedSVD, 10 components:
Explained variance: 0.9819485010960559
Vectorizer: tfidf
ROC: 0.7173738092180015
AP: 0.20282451460762962
With L1 penalty:
Vectorizer: tfidf
ROC: 0.7496553375031941
AP: 0.2292586687575256
Limiting to only ngrams that show up in more than x
emails, where x
is the mean:
Vectorizer: tfidf
ROC: 0.75481085951049
AP: 0.23154535530707157
This also failed to converge.
With x
as 30, like in HW1:
Vectorizer: tfidf
ROC: 0.7546503613290275
AP: 0.23150355562945776
Also failed to converge.
With scaling and PCA for 100 components:
Explained variance: 0.021099765181157443
Vectorizer: tfidf
ROC: 0.72265629247332
AP: 0.20918171284923562
Only last 4 features, with scaling:
Vectorizer: tfidf
ROC: 0.6982969940102426
AP: 0.18271077354771023
All except last 4, with scaling:
Vectorizer: tfidf
ROC: 0.7078180215315948
AP: 0.2012219303121121
Looking at these, I think it's clear that limiting the feature space (either through L1 regularization or artificially by saying "it has to show up in x
documents") seems to be the best choice here.
RF:
TFIDF:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=550, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=128,
n_jobs=None, oob_score=False, random_state=22, verbose=0,
warm_start=False)
Train AUC: 0.9970
Train AP: 0.9969
Train Precision: 0.8135
Train Recall: 0.9999
Dev AUC: 0.6885
Dev AP: 0.1857
Dev Precision: 0.1393
Dev Recall: 0.8358
Count:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=206, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=10,
min_weight_fraction_leaf=0.0, n_estimators=150,
n_jobs=None, oob_score=False, random_state=22, verbose=0,
warm_start=False)
Train AUC: 0.9302
Train AP: 0.9232
Train Precision: 0.7098
Train Recall: 0.9934
Dev AUC: 0.6782
Dev AP: 0.1788
Dev Precision: 0.1366
Dev Recall: 0.8273
Binary:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=206, max_features='log2',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=10,
min_weight_fraction_leaf=0.0, n_estimators=150,
n_jobs=None, oob_score=False, random_state=22, verbose=0,
warm_start=False)
Train AUC: 0.9299
Train AP: 0.9226
Train Precision: 0.7125
Train Recall: 0.9931
Dev AUC: 0.6766
Dev AP: 0.1827
Dev Precision: 0.1363
Dev Recall: 0.8213
More SVM:
no mods
Val AUC 0.7514917776749358
Val AP 0.2317726239381011
-----------------------------
last4
Val AUC 0.6941519462634351
Val AP 0.18116968617844315
-----------------------------
all but last 4
Val AUC 0.7125164711534802
Val AP 0.2053572619567911
-----------------------------
scaling
Val AUC 0.7334506913637129
Val AP 0.21935211298471108
-----------------------------
truncatedSVD100
Val AUC 0.7402465754611585
Val AP 0.2219357880331605
-----------------------------
truncatedSVD10
Val AUC 0.714849059164995
Val AP 0.20162762249021143
-----------------------------
with L1
Val AUC 0.7204560173481427
Val AP 0.20606592195047516
-----------------------------
ngram_mean
Val AUC 0.7512333615016391
Val AP 0.2316057835286238
-----------------------------
ngram_30
Val AUC 0.7503096814704876
Val AP 0.23059991152936476
-----------------------------
scaling + svd100
Val AUC 0.6917416193343989
Val AP 0.19332778082469648
-----------------------------
last 4 + scaling
Val AUC 0.6903680151775861
Val AP 0.176578647251165
More MNB:
no mods
VAL AUC 0.7498575657215163
VAL AP 0.2293512205357953
-----------------------------
last4
VAL AUC 0.6581026055173183
VAL AP 0.1437128131114489
-----------------------------
all but last 4
VAL AUC 0.7070023469057676
VAL AP 0.20591464071739765
-----------------------------
scaling
VAL AUC 0.6536068300836146
VAL AP 0.159505790364585
-----------------------------
truncatedSVD100
VAL AUC 0.7096800773626037
VAL AP 0.18745670852850765
-----------------------------
truncatedSVD10
VAL AUC 0.6878866303842578
VAL AP 0.15705489949022805
-----------------------------
ngram_mean
VAL AUC 0.7157689633180022
VAL AP 0.18875477473370467
-----------------------------
ngram_30
VAL AUC 0.6931802543914015
VAL AP 0.16490223532898143
-----------------------------
scaling + svd100
VAL AUC 0.67029364609327
VAL AP 0.1807745196643682
-----------------------------
last 4 + scaling
VAL AUC 0.6793984223370249
VAL AP 0.16104718364310758
Clearly the best are just the standard MNB/SVM. Was kind of hoping for some better results.
I think we're done here :)
UPDATED 5/15 (bold = to include in table)
Multinomial Naive Bayes (downsampled, predict_proba):
Bernoulli Naive Bayes (downsampled, predict_proba): NOTE: All vectorizers performed the exact same with Bernoulli NB. I think this is because Bernoulli NB only takes binary input so it binarizes all vectors before modeling (and maybe that result is the same for all vectorizers?). Apparently it is standard to use a CountVectorizer with binary=True for Bernoulli NB, so I would report only that one in the final paper.
Complement Naive Bayes (entire dataset, predictproba): (This was done since it is "particularly suited for imbalanced data sets".)_
I did some Naive Bayes upsampling vs downsampling investigations that I'll report separately from this.