UBC-DSCI / introduction-to-datascience-python

Open Source Textbook for DSCI100: Introduction to Data Science in Python
https://python.datasciencebook.ca
Other
12 stars 9 forks source link

Need "Evaluating on the test set" section for classification chapters #280

Closed ttimbers closed 10 months ago

ttimbers commented 10 months ago

In the classification chapter, after you do CV to select your parameters, then you need to refit/retrain on the entire training dataset for the chosen parameter(s) before evaluating on the test data set. In the book we don't do this in the classification case! Only the regression case...

I think this is not great - what if someone only reads our classification chapters and misses this? Like those two classification chapters are where we set the stage for how to do this correctly.

Like I think we need this section in the classification chapter: https://datasciencebook.ca/regression1.html#evaluating-on-the-test-set

I think it should go right above here: https://datasciencebook.ca/regression1.html#evaluating-on-the-test-set

ttimbers commented 10 months ago

Note - this also affects the R book.

joelostblom commented 10 months ago

Good call, I agree that we should add a section called "evaluating on the test set" in the classification 2 chapter. However, I don't think we need to change anything in the CV code to refit the model explicitly on all data since this is already done by the grid search. When we show cross-validation in the beginning of the chapter we are just describing how it works before we use it with the grid search and we are not predicting anything before tuning the model. I think we can add a section similar to this snippet in the regression chapter to explain what is going on:

To assess how well our model might do at predicting on unseen data, we will assess its RMSPE on the test data. To do this, we first need to retrain the KNN regression model on the entire training data set using 25 neighbors. Fortunately we do not have to do this ourselves manually; scikit-learn does it for us automatically. To make predictions with the best model on the test data, we can use the predict method of the fit GridSearchCV object.

And then show .score (plus maybe recall and precision?). And comment on the meaning of the results.