Closed gmcdonald-sfg closed 2 years ago
I think the link to the Breiman reference addresses this. The caveat addresses point #1. I think it's instructive to see some models of each type. I'll add some language to the caveat. Also, changed "you don't really need" to "you don't always need".
Even though random forests do have a built-in way of assessing performance using OOB, while it may be technically defensible to use this in some applications without separate training/testing datasets, I still think it is always good practice to use separate training/testing datasets in ML. Here’s a good explanation: https://www.dataminingapps.com/2018/02/is-it-really-necessary-to-split-a-data-set-into-training-and-validation-when-building-a-random-forest-model-since-each-tree-built-uses-a-random-sample-with-replacem/ . Beyond the 3 reasons listed here, another good reason to use separate training/testing datasets is that it allows you to calculate any performance metric you want, not just the OOB error rate or OOB MSE that is provided internally by random forest. I would suggest that for all of the random forest examples, you use separate training/testing datasets, just like you’ve done in the previous examples.