arendakessian / spring2020-ml-project

fake review detection system
2 stars 3 forks source link

.ipynb for baseline model #5

Closed guidopetri closed 4 years ago

guidopetri commented 4 years ago

Once all the feature engineering is done, we're ready to get cracking with model-building. However, we should have a baseline to compare performance to. He He suggested in class that this should be a Naïve Bayes classifier; though often it's also done as a simple, out-of-the-box random forest. We could also come up with some simple heuristics ourselves, e.g. "if a user has more than 4 reviews, it's a real review".

Any opinions? @kelseymarkey @arendakessian

kelseymarkey commented 4 years ago

I think all three of those seem like reasonable baselines. Is it crazy to implement all three (since I don't think it will be a ton of work)?

guidopetri commented 4 years ago

It wouldn't be a ton of work, but I'm not sure we're "allowed" to have 3 baselines. Maybe we just take the best one as the baseline?

kelseymarkey commented 4 years ago

Sounds good to me. We can add a line in the report about how we considered multiple possible baselines and ultimately decided one of them was most appropriate.

guidopetri commented 4 years ago

Heuristic baseline is pretty much implemented. Initial results:

Simple heuristic baseline
Train AUC: 0.6218
Dev   AUC: 0.6235
Train AP:  0.1321
Dev   AP:  0.1310
guidopetri commented 4 years ago

Naïve Bayes:

NB baseline
Train AUC: 0.6106
Dev   AUC: 0.5015
Train AP:  0.3008
Dev   AP:  0.1040
guidopetri commented 4 years ago

Some crappy RF experiments:

n_estimators=12, max_features='log2'

RF baseline
Train AUC: 0.9312
Dev   AUC: 0.5108
Train AP:  0.8739
Dev   AP:  0.1093

n_estimators=12, max_features='log2', max_depth=1000

RF baseline
Train AUC: 0.5006
Dev   AUC: 0.5000
Train AP:  0.1041
Dev   AP:  0.1016

n_estimators=12, max_features='auto', max_depth=100

RF baseline
Train AUC: 0.5009
Dev   AUC: 0.5003
Train AP:  0.1045
Dev   AP:  0.1021

I chose n_estimators=12 because I have 12 available cores to run this on.

It seems like RF is not the way to go. It's heavily overfitting on train when not specifying a max depth (and that took ~30min to run). When specifying a max depth, we get really crappy results.

guidopetri commented 4 years ago

Another one:

n_estimators=12, min_samples_leaf=0.001, max_features='auto'

RF baseline
Train AUC: 0.5000
Dev   AUC: 0.5000
Train AP:  0.1029
Dev   AP:  0.1016
guidopetri commented 4 years ago

I'll keep running a couple of RFs looking for something in-between the overfit first example and the extremely underfit other examples, but... this is looking like the heuristic one is the best one with an AUC of .62 and AP of .13. I'd like to have something more like an AP of .30 like NB on the train set but that isn't reflected on dev...

Thoughts?

guidopetri commented 4 years ago

Alright... seems like whatever I try, I can't improve dev AUC/AP with RF.

I'd say our baseline should be the heuristic one.

kelseymarkey commented 4 years ago

@charlesoblack Nice work, and agreed that seems totally reasonable. Could you explain a bit more what the "heuristic model" is (either here or in the google doc)? Is it 2 or less reviews = fake?

If we want to delve into this deeper (might be a good idea for the report?)- I'd be curious to see performance when the heuristic model is 1 review = fake. I'd also be interested to see some additional confusion matrix metrics- specifically FPR since in our use case that is what we want to minimize (saying a user's review is fake when its not is likely to anger them, and potentially prevent them from using the system).

guidopetri commented 4 years ago

Sure. Basically I just did some simple EDA and came up with this graph:

It shows the average label value (i.e. what % of reviews are fake) for a given "review count" feature. This means, for instance, that people's first reviews (marked as 0 - 0-indexed) were ~16% fake. This quickly drops off and stabilizes at 2%. I then just estimated what might be a good value to cut that off at and say "the first x reviews that someone has are fake" and built a small classifier by hand out of that.

Here's some more versions of it, with different cutoffs:

Heuristics classifier with different cutoffs ``` Simple heuristic baseline for 0 Train AUC: 0.6787 Dev AUC: 0.6797 Train AP: 0.1558 Dev AP: 0.1543 Simple heuristic baseline for 1 Train AUC: 0.6461 Dev AUC: 0.6491 Train AP: 0.1406 Dev AP: 0.1398 Simple heuristic baseline for 2 Train AUC: 0.6218 Dev AUC: 0.6235 Train AP: 0.1321 Dev AP: 0.1310 Simple heuristic baseline for 3 Train AUC: 0.6027 Dev AUC: 0.6047 Train AP: 0.1264 Dev AP: 0.1253 Simple heuristic baseline for 4 Train AUC: 0.5882 Dev AUC: 0.5892 Train AP: 0.1224 Dev AP: 0.1211 Simple heuristic baseline for 5 Train AUC: 0.5773 Dev AUC: 0.5769 Train AP: 0.1196 Dev AP: 0.1179 ```

I suppose the best cutoff would have been at 0, i.e., all the reviews are real - however, this was done without the downsampling, so obviously always choosing the majority class would give us the best result. I think I'll redo this with the downsampled data afterwards, if we have enough time.

Additionally (as mentioned over voice), just to have it written down: precision is true positives divided by true + false positives, so if we're maximizing precision, we're minimizing the false positive rate (as you correctly mentioned). Thanks again for pointing that out - I don't think I would have realized it, haha.

guidopetri commented 4 years ago

Now that we have downsampled data, should we run the baseline models on it? I'm guessing yes.

I'm also wondering if a baseline as "assume everything is real" works, too. We'd have 90% accuracy, after all (no clue about AUC/AP though). I'll run that as well and post the results here.

This being said, we should probably think about whether it makes sense to baseline on the downsampled set (I think it does) and what to put on the report in the end. @arendakessian @kelseymarkey thoughts?

guidopetri commented 4 years ago

Baseline redone. The heuristics classifier still seems to be the best. I even tried making it a little more complicated (filtering by review count and rating for avg label >= 0.5) and it still only performed as well as calling all the 0'th reviews fake.

Good news is: the NB and RF perform a lot better now. So maybe there's still hope for our models, haha.