UBC-DSCI / introduction-to-datascience-python

Open Source Textbook for DSCI100: Introduction to Data Science in Python
https://python.datasciencebook.ca
Other
10 stars 8 forks source link

Various improvements to predictive chapters #314

Closed trevorcampbell closed 10 months ago

trevorcampbell commented 10 months ago
github-actions[bot] commented 10 months ago

Hello! I've built a preview of your PR so that you can compare it to the current main branch.

joelostblom commented 10 months ago

Nice! I think the added section is clear and will be helpful for students. One thing I would add is a comment on recall and precision compared to the un-tuned classifier, which are actually slightly different (in an undesirable way for recall). I also noticed that we did an error in our computation of recall further up and that we never show how to use sklearn to get these numbers; we only do it manually via the confusion matrix. Details with screenshots:

  1. Mix up of positive label. In 6.3, we correctly define "Malignant" as our positive label, since that is what we are looking for: image

    However, in 6.5.5, we use "Benign" as the possible label when we manually compute precision and recall image

  2. Not showing how to compute precision and recall with sklearn. In 6.5.5 We show how to compute accuracy via .score, but we never show how to compute recall and precision, instead we compute these manually from the confusion matrix: image

    I think we could consider changing that paragraph to:

    The output shows that the estimated accuracy of the classifier on the test data was 88%. To compute the precision and recall we can use the following functions from scikit-learn: code cell with precision/recall computation We can see that our precision was .... and our recall was ... . Finally, we can also look at the confusion matrix for the classifier using the crosstab function from pandas. The crosstab function takes two arguments: the actual labels first, then the predicted labels second. The columns and rows are ordered alphabetically, but our positive label is still "Malignant", even if it is not in the top left corner as in the general confusion matrix above.

  3. Comment on precision and/or recall after tuning.

    Before tuning: image

    After tuning: image

    I think we can add a comment on that although accuracy remain similar, our classifier now has slightly worse recall and misses more malignant samples (TPs), and briefly remind students why TPs are important in our context and that we might think more carefully about how to choose our optimal hyperparameters. Optionally we can also show the sklearn functions to compute the scores here in addition to doing it manually.

trevorcampbell commented 10 months ago

@joelostblom all great comments (and I'm not sure how I missed that benign vs malignant issue in the equations...probably when we did the transpose of the matrix I forgot to adjust those glues too)

Thanks a lot -- will fix those and merge.