Feedback on regression chapters?

ttimbers commented 5 years ago

@msalibian - I am looking for some feedback on the regression chapters for the DSCI 100 course notepack/textbook. I currently have these two chapters drafted:

The students have already read Chapter 8, but I can still make corrections and address them in class. Students will soon read Chapter 9, but same thing goes, I am happy to address gaps/errors as needed. Comments welcome in this issue thread or you can directly edit the following Rmd's:

ttimbers commented 5 years ago

Also, please note that I am trying to keep this notepack/textbook as accessible as possible and so my language around things is often intentionally informal. I am happy to change things however if you think it would be better in some cases.

msalibian commented 5 years ago

@ttimbers I found the version of Chapter 8 of the book here to be different from that in the Rmd file here, so I worked with the HTML version. Below are some suggestions for Chapter 8. I'll look at Chapter 9 later tonight, or maybe tomorrow (Wed). Congratulations on these notes. They are super valuable, and will be great resource to have in the Dept, not only for DSCI 100.

Matias comments / suggestions

8.3 Regression

Mention that (in addition to prediction) regression can also be used to model the relationship between two or more variables, but that here we will focus on prediction.
- Instead of "We will use this question to explore regression in this text book" maybe say "We will use regression to explore this question in the rest of this chapter, using a real estate data set from Sacremento, California that is available in the caret package. "

8.5

"Let’s take a small sample of the data above and walk through ...", I would emphasize a bit more that this subsample is taken only to be able to illustrate the mechanics of K-NN with a few data points, but that we will later use all the data.
Instead of using 2000 sqf as the first example, I would start with 1250 sqf, where you have a few observations either on x = 1250, or almost on it. Then, intuitively, one'd say that the price should be around $150K, since the y's are all around that value. We can then suggest taking the average of these values. I say this because for x = 2000 sqf there aren't any observations, on x = 2000, and then we need to borrow from neighbours "farther away", and since half of them are noticeably lower and some are noticeably higher, taking the average of them may not be that intuitive to all the students, whereas if they are all closer to each other (as they are for x = 1250), then it may feel more natural to average them?
I would then end the section showing predictions (using 5-NN on the whole data set) for a grid of square footage values, say seq(500, 5000, by=100) or something like that, before moving onto 8.6 to assess these predictions, for example.

8.6

Finish the section showing the predictions with k = 5 and also those with the optimal k = 51 for the same grid used at the end of 8.5, for example.

8.7

I'd suggest changing "Another term that we use to collectively describe this kind of model is overfitting." to "Another term that we use to collectively describe this phenomenon is overfitting."

8.9.2

"Does not perform well with a large number of predictors unless the size of the training set is exponentially larger "
I'm torn about this limitation because it is not specific to K-NN... but it is true nonetheless...

msalibian commented 5 years ago

@ttimbers Here's the rest of my comments. Once again, congratulations! this is a great set of notes / textbook.

A couple of additional comments on Chapter 8, section 8.6

Formula for RMSE (or RMSPE), where we say that $\hat{y}_i$ are "the forcasted/predicted value for the i-th observation", would it be possible to say that these need to be computed "independently" from the training set for that to be a proper assessment of future / predictive performance? This is what caret does, but maybe repeat it with words, saying how each $\hat{y}_i$ was computed?
Would it be too much information for the students to at least mention that if you were to run 10-fold CV again, the optimal k may be different? Maybe even show them a different run? Does caret set the random seed itself, or can one set it before calling train()?

Chapter 9

9.1

A typo: explortion should be exploration

9.4

A suggestion, where you discuss how the best line is chosen, you say "the line that minimzes the vertical distance between itself and each of the observed data points" (typo: note the missing i in minimzes), I would stress that this is the average vertical distance between its fitted values and the observed ones (my point here is mentioning the idea of minimizing the average discrepancy)
At the end of the section, where it says "We use the same error function that we used with k-nn regression", I'd say "We use the same measure of predictive performance we used with k-nn regression, which can be computed with the same function in R etc." My point is to try to separate the concept / quantity from the R implementation (the "function"). But maybe you have used the word function to refer to things that are not R functions before, in which case my point is irrelevant...

9.5

Where you say "In fact, in the case of a simple linear regression (where we only have one predictor)" ... "hyperparameters", I'm not sure what you have in mind here, if you mean to keep this in the context of linear regression, even with multiple explanatory variables. I can't think of any parameters needed to be chosen, but it's a bit early in the morning, so maybe I'm missing something?

ttimbers commented 5 years ago

@msalibian - Thanks for the very helpful feedback! I really appreciate it! Can I please add you as an author/contributor to credit you for your contributions? You also gave much feedback on Melissa's classification chapters, so I think it is well deserved.

I have addressed all the feedback for chapter 9, and the smaller changes for chapter 8. I will keep this issue open and loop back and address the bigger changes you suggest (which I paste below to remind myself of what I have left to do) once I have this week's worksheet and lecture slides done:

Instead of using 2000 sqf as the first example, I would start with 1250 sqf, where you have a few observations either on x = 1250, or almost on it. Then, intuitively, one'd say that the price should be around $150K, since the y's are all around that value. We can then suggest taking the average of these values. I say this because for x = 2000 sqf there aren't any observations, on x = 2000, and then we need to borrow from neighbours "farther away", and since half of them are noticeably lower and some are noticeably higher, taking the average of them may not be that intuitive to all the students, whereas if they are all closer to each other (as they are for x = 1250), then it may feel more natural to average them?

I would then end the section showing predictions (using 5-NN on the whole data set) for a grid of square footage values, say seq(500, 5000, by=100) or something like that, before moving onto 8.6 to assess these predictions, for example.

8.6

Finish the section showing the predictions with k = 5 and also those with the optimal k = 51 for the same grid used at the end of 8.5, for example.

msalibian commented 5 years ago

@ttimbers No need to give me credit for these off-the-cuff comments and suggestions!

ttimbers commented 3 years ago

Thanks again @msalibian for this feedback, most/all has been used to improve this chapter!

UBC-DSCI / introduction-to-datascience

Feedback on regression chapters? #3

Matias comments / suggestions