cis-ds / Discussion

Public discussion
10 stars 15 forks source link

General question on cross validation #210

Open BaichenTan opened 2 years ago

BaichenTan commented 2 years ago

I am a little bit confused of cross validation. In the lecture we do not use a for loop because we only want to use the valuable testing data for once. Therefore, instead we use a cross validation method. However, in the class exercises, isn't that we are using cross validation within the training observations (like first dividing the whole datasets into testing and training sets, and then divide again the training datasets into folds of analysis sets and assessment sets)? Then we can use the folds to compute 10 different estimates and then summarize them to get the average value. However, isn't that all of those calculations are still done in the training dataset? And we do not use the testing datasets? 7691657382414_ pic For example, in the class exercise, why we use data = bechdel_training instead of data = bechdel in vfold_cv() function? Cus in Turn 5 we don't really use the testing data. My interpretation of cross validation is that we first subdivide the training set of bechdel into 10 folds and then calculated the average model, and then use the average model to apply to the training bechdel? But it seems that here we are only using the training set to find the estimate. So if we want to incorporate the testing bechdel in Turn 5, what should we do? On the other hand, can we directly use data = bechdel in vfold_cv() so that we can directly use the folders that are based on the whole dataset instead of just training dataset to run fit_resamples?

bensoltoff commented 2 years ago

We only want to use bechdel_test one time - once we have decided on the best model and want to generate predictions from it. If we use it multiple times, we will generate biased estimates of our models' performance.

By applying CV only to the training set, we still get the benefits of resampling without using the test set more than once.