Open ConorWaldron opened 3 years ago
Try changing the default hyper parameters (number of trees etc).
using 1000 trees instead of the default value of 100 increased performance only very marginally accuracy 0.81 -> 0.82 F1 0.75 -> 0.76 AUC ROC 0.89 -> 0.90 AUC PR 0.86 -> 0.87
Do random forests need data pre processing, check if mean centering the data makes a difference
The internet says random forests do not need data processing (mean scaling or other normalisation methods) as they are decision tree based methods, so it wont affect results at all.
you just need to one hot encode categorical/binary data
Looking at the feature importances we see the most important variables are you age, sex and how much you paid for your ticket. the class of your ticket matters less than what you paid for it.
Using AdaBoost model resulted in a significant improvment in the test data preformance
Accuracy 0.81 -> 0.92 F1 0.75 -> 0.88 ROC AUC 0.89 -> 0.93 PR AUC 0.86 -> 0.85
Strangely the AdaBoost performance on the train set is not perfect
Strangely the adaboost performance on the training set is worse than its preformance on the test set.
The important features for the AdaBoost method change significantly from the random forest, for the Adaboost fare and age are the most important predictors, not gender.
How does feature importance actually work?
What direction do the features have on the model?
On the training set, the model does very well, with near perfect ROC and PR curves, an F1 score of 0.97 and an accuracy of 0.98. This is expected for a random forest model, the only reason we dont get perfect accuracy is probably because of 2 people who have identical feature vectors but different outcomes so the model cant split them.
On the test set the model still does quite well: F1 = 0.75 and Accuracy = 0.81
The drop in performance between train and test set is not evidence of over fitting, it is just how random forests work, they get near perfect performance on the train set. They cant actually over fit as rather than extending the decision tree boundaries in weird ways to reach outliers/noise, they wrap them in near infinitely small volume. See read me for more info.