Understanding on train,cv,test

digitech-ai commented 4 years ago

For a given full dataset , we split the data to train test (lets say 70-30) .

Now we perform EDA, model selection on train data.

lets say we go ahead with random forest , we would like to perform cross validation score on it before trying it out on test data.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

from above, cv=10 that means 10 kfold startagey will be implemented, model will train and tested 10 times. For each iteration, 9/10 of train data (which is 70% of full data) becomes the new training data for model training and rest cross validation set will be used for model evaluation. Like wish this will be repeated upto 10 times with different training and cv set.

My point is to understand that our actual training size becomes (70*(9/10) )=63%

Though once we get the mean of score accuracy, we can re-train the model on entire training data or even full dataset before making it live on production data.

Please help me if my understanding is correct.

Praful932 commented 4 years ago

@digitech-ai Yes you are correct, infact after you shortlist multiple training models and when you fine-tune them using search such as Grid Search or Randomized, by default these Searches actually refit on the whole dataset that you fed after finding the best hyperparameters and return the best estimator. For eg. You can look at the refit parameter of GridSearchCV in the docs here.

digitech-ai commented 3 years ago

Thanks!

ageron / handson-ml

Understanding on train,cv,test #593