Closed digitech-ai closed 3 years ago
@digitech-ai Yes you are correct, infact after you shortlist multiple training models and when you fine-tune them using search such as Grid Search
or Randomized
, by default these Searches actually refit on the whole dataset that you fed after finding the best hyperparameters and return the best estimator. For eg. You can look at the refit
parameter of GridSearchCV in the docs here.
Thanks!
For a given full dataset , we split the data to train test (lets say 70-30) .
Now we perform EDA, model selection on train data.
lets say we go ahead with random forest , we would like to perform cross validation score on it before trying it out on test data.
from above, cv=10 that means 10 kfold startagey will be implemented, model will train and tested 10 times. For each iteration, 9/10 of train data (which is 70% of full data) becomes the new training data for model training and rest cross validation set will be used for model evaluation. Like wish this will be repeated upto 10 times with different training and cv set.
My point is to understand that our actual training size becomes (70*(9/10) )=63%
Though once we get the mean of score accuracy, we can re-train the model on entire training data or even full dataset before making it live on production data.
Please help me if my understanding is correct.