MichaelAquilina / Reddit-Recommender-Bot

Indentifying Interesting Documents for Reddit using Recommender Techniques
7 stars 0 forks source link

Compare Machine Learning Performance #29

Closed MichaelAquilina closed 10 years ago

MichaelAquilina commented 10 years ago

Current performance is being tested with a linear Support Vector Machine. It would be smart to evaluate the performance of several classifiers on the dataset:

These are all supported from within sklearn and make use of the same interface as SVC so evaluating their performance should be no issue at all.

The results below are for: 300 Python vs 300 Science (Binary classification - simple) using 1 round of train_test_split

Initial tests show that the two best performing are:

These correspond to results posted in other papers which is encouraging.

MichaelAquilina commented 10 years ago

Naive Bayes (Multinomial) [[76 0] [ 2 72]] Accuracy: 0.986666666667 Precision: 1.0 Recall: 0.972972972973 F1 Measure: 0.986301369863

Naive Bayes (Bernoulli) [[69 6] [ 1 74]] Accuracy: 0.953333333333 Precision: 0.925 Recall: 0.986666666667 F1 Measure: 0.954838709677

MichaelAquilina commented 10 years ago

Support Vector Machine (Linear) [[73 2] [ 1 74]] Accuracy: 0.98 Precision: 0.973684210526 Recall: 0.986666666667 F1 Measure: 0.980132450331

Support Vector Machine (Radial Basis Function) [[65 9] [ 0 76]] Accuracy: 0.94 Recall: 1.0 Precision: 0.894117647059 F1 Measure: 0.944099378882

Note that a linear kernel seems to provide improved performance because we are already working in such high dimensional space

MichaelAquilina commented 10 years ago

Random Forest Classifier [[73 1] [ 5 71]] Accuracy: 0.96 Precision: 0.986111111111 Recall: 0.934210526316 F1 Measure: 0.959459459459

MichaelAquilina commented 10 years ago

K Nearest Neighbors (K=5, metric='minkowski') [[20 57] [ 0 73]] Accuracy: 0.62 Precision: 0.561538461538 Recall: 1.0 F1 Measure: 0.71921182266

K Nearest Neighbors (K=5, metric='euclidean') [[20 55] [ 0 75]] Accuracy: 0.633333333333 Precision: 0.576923076923 Recall: 1.0 F1 Measure: 0.731707317073

Notice how it manages to generate 1 class correctly (full recall) but mixes up the other class as the latter. This could indicate some good performance with some fine tuning, but support vector machines seem likely to be the better choice seeing as they just seem to work so well.

MichaelAquilina commented 10 years ago

These techniques may prove redundant if the plans for one-class classification prove effective.