What is wrong with my dear random forest model?!

Elise-douard commented 4 years ago

Hey @illdopejake (and/or other?) !

I just finished to run a random forest model for this project with a script inspired by your dear presentation on SVR model. (Thanks for this great content!)

The features are the volumes for 68 regions in > 35 000 brains And the labels are 2 groups highly imbalanced ...

I get super-suspicious r2 and MAE with my final set: accuracy (r2) = 0.96 MAE = 0.035

I also "tweeked" the depth parameter to see if there was something wrong on this side, but it seems that the value choose (10) was the best (after, there is an over-fitting, see figure bellow)

Can these perfect r2 and MAE be due to the large amount of data (n > 35 000 brains to feed this model) or the imbalances groups? Is there a specific step to do for random forest? Or can I publish in Nature now? :p

Thanks !

illdopejake commented 4 years ago

Hi Elise,

It was a pleasure going through your notebook. The question is interesting and the notebook was very well documented and easy to follow. I have a few notes related to your questions here:

1) Accuracy metrics: The first thing to point out is that the accuracy metrics you use are (in most cases) not related to your loss function. In other words, you are using mae and (at times) r2 to evaluate your results, but those metrics are more for evaluation of regression problems. If I understand correctly, you are doing a classification problem, and so you are better off looking at accuracy, precision, recall and the like. In this case, your loss function is probably accuracy, so its important to evaluate the model along similar criteria. You can access these metrics in sklearn.metrics, view the documentation here. Some other functions I find useful in evaluating classification results are classification_report and confusion_matrix. For an example of how to use these metrics but in a context you are familiar with, you can see an older iteration of our ML presentation here, where we did a classification problem instead of regression.

2) Class imbalance: As you've learned, it is important to consider your accuracy metrics and visualize your results when you have serious class imbalance. Here, your one class is size 24,146 (96.4% of your data) and your other class is size 885 (3.6% of your data). Notice something familiar here? Your accuracy was exactly 96.4%! Your classifier is trained on accuracy, and it learned if it just calls everyone class 1, it will reach a very high level of accuracy. But that is not helpful to you is it!?

So how do you get around this? There are a few options. Some estimators allow you to choose the loss function of your models with the score or scoring argument, so in those cases it may be more useful to train your model on precision or recall instead. Other estimators, such as the random_forest_classifier that you are using have ways of dealing with class imbalance. Check out the class_weight argument in the random_forest_classifier documentation. These are probably the best approaches. There are also some newer resampling strategies in sklearn that you could experiment with, though in my limited experience with them, they haven't been terribly effective.

Why don't you give some of those options and a try and we'll see if that improves things. At the very least, you will want to make sure to report your precision and recall metrics alongside accuracy, and will want to visualize a confusion matrix as well.

Elise-douard commented 4 years ago

(oops I removed the comment)

Thanks you so much for this detailed response and for sharing your scripts. I was totally mind blown by the fact that the acc=%controls, this seems so evident but I had NO CLUE! https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif

You saved this project! I will try my best to apply your suggestions, and hopefully, it will ends good.

brainhack-school2020 / EliseD_BLUP_Brain-Learning-Unicorn-Project

What is wrong with my dear random forest model?! #4