PhilippPro / tuneRanger

Automatic tuning of random forests
33 stars 9 forks source link

Should I be using a train/test split with tuneRanger? #8

Closed taylorreiter closed 4 years ago

taylorreiter commented 4 years ago

Hi @PhilippPro! Thank you for tuneRanger, I've been having fun trying it out, and it has dramatically simplified my tuning pipeline. I'm curious how the samples are handled under the hood...do I need to split my data set 70:30 into training and testing, and then run tuneRanger on the 70%? I have been using a 70:30 split for training and testing, and then I have and independent validation dataset. I don't see that you use a train/test split in your documentation, so I am curious what the recommended best-practice is.

jakob-r commented 4 years ago

Allow me to answer this question:

From the README

Out-of-bag predictions are used for evaluation, which makes it much faster than other packages and tuning strategies that use for example 5-fold cross-validation

In the code: https://github.com/PhilippPro/tuneRanger/blob/abe82774ce449f6acc5623e9e9e5d867c3efd910/R/tuneRanger.R#L99-L100

In other words: tuneRanger does not need a test set because each tree is only trained on a subset of the data (bag), so we can use the rest (out of bag) to obtain an unbiased performance estimation of a single tree and therefore of all trees.

Does this answer your question?

taylorreiter commented 4 years ago

Yes, thank you so much for taking the time to answer this question, and to give the details so clearly! I really appreciate it.

jakob-r commented 4 years ago

You're welcome.