carpentries-incubator / r-ml-tabular-data

A Data-Carpentry-style lesson on some ML techniques in R
https://carpentries-incubator.github.io/r-ml-tabular-data/
Other
3 stars 0 forks source link

_episodes_rmd/04-Decision-Forests.Rmd: Major text and code edit #16

Closed gmcdonald-sfg closed 2 years ago

gmcdonald-sfg commented 2 years ago

Even though random forests do have a built-in way of assessing performance using OOB, while it may be technically defensible to use this in some applications without separate training/testing datasets, I still think it is always good practice to use separate training/testing datasets in ML. Here’s a good explanation: https://www.dataminingapps.com/2018/02/is-it-really-necessary-to-split-a-data-set-into-training-and-validation-when-building-a-random-forest-model-since-each-tree-built-uses-a-random-sample-with-replacem/ . Beyond the 3 reasons listed here, another good reason to use separate training/testing datasets is that it allows you to calculate any performance metric you want, not just the OOB error rate or OOB MSE that is provided internally by random forest. I would suggest that for all of the random forest examples, you use separate training/testing datasets, just like you’ve done in the previous examples.

djhunter commented 2 years ago

I think the link to the Breiman reference addresses this. The caveat addresses point #1. I think it's instructive to see some models of each type. I'll add some language to the caveat. Also, changed "you don't really need" to "you don't always need".