Closed guidopetri closed 4 years ago
I think all three of those seem like reasonable baselines. Is it crazy to implement all three (since I don't think it will be a ton of work)?
It wouldn't be a ton of work, but I'm not sure we're "allowed" to have 3 baselines. Maybe we just take the best one as the baseline?
Sounds good to me. We can add a line in the report about how we considered multiple possible baselines and ultimately decided one of them was most appropriate.
Heuristic baseline is pretty much implemented. Initial results:
Simple heuristic baseline
Train AUC: 0.6218
Dev AUC: 0.6235
Train AP: 0.1321
Dev AP: 0.1310
Naïve Bayes:
NB baseline
Train AUC: 0.6106
Dev AUC: 0.5015
Train AP: 0.3008
Dev AP: 0.1040
Some crappy RF experiments:
n_estimators=12, max_features='log2'
RF baseline
Train AUC: 0.9312
Dev AUC: 0.5108
Train AP: 0.8739
Dev AP: 0.1093
n_estimators=12, max_features='log2', max_depth=1000
RF baseline
Train AUC: 0.5006
Dev AUC: 0.5000
Train AP: 0.1041
Dev AP: 0.1016
n_estimators=12, max_features='auto', max_depth=100
RF baseline
Train AUC: 0.5009
Dev AUC: 0.5003
Train AP: 0.1045
Dev AP: 0.1021
I chose n_estimators=12 because I have 12 available cores to run this on.
It seems like RF is not the way to go. It's heavily overfitting on train when not specifying a max depth (and that took ~30min to run). When specifying a max depth, we get really crappy results.
Another one:
n_estimators=12, min_samples_leaf=0.001, max_features='auto'
RF baseline
Train AUC: 0.5000
Dev AUC: 0.5000
Train AP: 0.1029
Dev AP: 0.1016
I'll keep running a couple of RFs looking for something in-between the overfit first example and the extremely underfit other examples, but... this is looking like the heuristic one is the best one with an AUC of .62 and AP of .13. I'd like to have something more like an AP of .30 like NB on the train set but that isn't reflected on dev...
Thoughts?
Alright... seems like whatever I try, I can't improve dev AUC/AP with RF.
I'd say our baseline should be the heuristic one.
@charlesoblack Nice work, and agreed that seems totally reasonable. Could you explain a bit more what the "heuristic model" is (either here or in the google doc)? Is it 2 or less reviews = fake?
If we want to delve into this deeper (might be a good idea for the report?)- I'd be curious to see performance when the heuristic model is 1 review = fake. I'd also be interested to see some additional confusion matrix metrics- specifically FPR since in our use case that is what we want to minimize (saying a user's review is fake when its not is likely to anger them, and potentially prevent them from using the system).
Sure. Basically I just did some simple EDA and came up with this graph:
It shows the average label value (i.e. what % of reviews are fake) for a given "review count" feature. This means, for instance, that people's first reviews (marked as 0 - 0-indexed) were ~16% fake. This quickly drops off and stabilizes at 2%. I then just estimated what might be a good value to cut that off at and say "the first x
reviews that someone has are fake" and built a small classifier by hand out of that.
Here's some more versions of it, with different cutoffs:
I suppose the best cutoff would have been at 0, i.e., all the reviews are real - however, this was done without the downsampling, so obviously always choosing the majority class would give us the best result. I think I'll redo this with the downsampled data afterwards, if we have enough time.
Additionally (as mentioned over voice), just to have it written down: precision is true positives divided by true + false positives, so if we're maximizing precision, we're minimizing the false positive rate (as you correctly mentioned). Thanks again for pointing that out - I don't think I would have realized it, haha.
Now that we have downsampled data, should we run the baseline models on it? I'm guessing yes.
I'm also wondering if a baseline as "assume everything is real" works, too. We'd have 90% accuracy, after all (no clue about AUC/AP though). I'll run that as well and post the results here.
This being said, we should probably think about whether it makes sense to baseline on the downsampled set (I think it does) and what to put on the report in the end. @arendakessian @kelseymarkey thoughts?
Baseline redone. The heuristics classifier still seems to be the best. I even tried making it a little more complicated (filtering by review count and rating for avg label >= 0.5) and it still only performed as well as calling all the 0'th reviews fake.
Good news is: the NB and RF perform a lot better now. So maybe there's still hope for our models, haha.
Once all the feature engineering is done, we're ready to get cracking with model-building. However, we should have a baseline to compare performance to. He He suggested in class that this should be a Naïve Bayes classifier; though often it's also done as a simple, out-of-the-box random forest. We could also come up with some simple heuristics ourselves, e.g. "if a user has more than 4 reviews, it's a real review".
Any opinions? @kelseymarkey @arendakessian