ConorWaldron / Kaggle

For Kaggle Problems
0 stars 0 forks source link

Random Forest for Classification #7

Open ConorWaldron opened 3 years ago

ConorWaldron commented 3 years ago

On the training set, the model does very well, with near perfect ROC and PR curves, an F1 score of 0.97 and an accuracy of 0.98. This is expected for a random forest model, the only reason we dont get perfect accuracy is probably because of 2 people who have identical feature vectors but different outcomes so the model cant split them.

image image

On the test set the model still does quite well: F1 = 0.75 and Accuracy = 0.81 image image

The drop in performance between train and test set is not evidence of over fitting, it is just how random forests work, they get near perfect performance on the train set. They cant actually over fit as rather than extending the decision tree boundaries in weird ways to reach outliers/noise, they wrap them in near infinitely small volume. See read me for more info.

ConorWaldron commented 3 years ago

Try changing the default hyper parameters (number of trees etc).

using 1000 trees instead of the default value of 100 increased performance only very marginally accuracy 0.81 -> 0.82 F1 0.75 -> 0.76 AUC ROC 0.89 -> 0.90 AUC PR 0.86 -> 0.87

ConorWaldron commented 3 years ago

Do random forests need data pre processing, check if mean centering the data makes a difference

The internet says random forests do not need data processing (mean scaling or other normalisation methods) as they are decision tree based methods, so it wont affect results at all.

you just need to one hot encode categorical/binary data

ConorWaldron commented 3 years ago

Looking at the feature importances we see the most important variables are you age, sex and how much you paid for your ticket. the class of your ticket matters less than what you paid for it. image

ConorWaldron commented 3 years ago

Using AdaBoost model resulted in a significant improvment in the test data preformance

Accuracy 0.81 -> 0.92 F1 0.75 -> 0.88 ROC AUC 0.89 -> 0.93 PR AUC 0.86 -> 0.85

image image image image image

Strangely the AdaBoost performance on the train set is not perfect

Strangely the adaboost performance on the training set is worse than its preformance on the test set.

The important features for the AdaBoost method change significantly from the random forest, for the Adaboost fare and age are the most important predictors, not gender.

ConorWaldron commented 3 years ago

How does feature importance actually work?

What direction do the features have on the model?