UBC-DSCI / introduction-to-datascience

Open Source Textbook for DSCI100: Introduction to Data Science in R
https://datasciencebook.ca/
Other
50 stars 54 forks source link

A few suggestions for Chapter 7 #431

Closed Lourenzutti closed 1 year ago

Lourenzutti commented 2 years ago
ttimbers commented 2 years ago

Thanks for the feedback @Lourenzutti ! I will respond to each point inline below:

In STATS, the term multivariate is usually associated with multiple responses, not multiple predictors. I'd suggest changing "Multivariate Regression" to "Multiple Regression"

We don't disagree with you and this point was also brought to us by the book reviewers. They suggested we use the term multivariable regression (not multivariate - for the reasons you state above). Perhaps we missed fixing this in all instances however? If you can point to where we use multivariate we will for sure fix that.

I like the note at the end of Section 7.3, however it can cause confusion with ordinal categorical variables. "Time spent watching tv: none, little, medium, a lot" --> is "little less than a lot? Yes."

Again, I don't disagree with you... Just not sure we want to bring up the idea of ordinality in this book due to the level of the audience... If you have a suggestion of how to tweak the note without bringing in the idea of ordinality, we would be open to improving the note.

The following statement can be a little misleading in my opinion: One strength of the KNN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships (i.e., if the relationship is not a straight line). Linear regression can also fit some relationships that are not straight line. Students might associate Linear Regression with straight lines which are misleading. Maybe a better way of saying is that KNN makes no assumption about the form of the relationship?

This is tricky for this book, as in this book, we don't really cover how you can use linear regression to fit relationships that are not well represented by straight lines (we only very modestly touch on it here). If you don't like the comment about this in chapter 7, I think you might be even less happy in chapter 8 in the section where we compare linear regression and KNN... @trevorcampbell - would love to hear your response here... Do we want to add a note saying that linear regression can be done in a way to model non-linear relationships, however doing this is beyond the scope of the book???

I agree with "The algorithm really has very few assumptions about what the data must look like for it to work.". Maybe I would just add a quick note that the method can be very data-hungry as the number of predictors increases (to help students see a weakness of the method and justify the usage of simpler models with more stringent assumptions).

In this section we do state:

Weaknesses: K-nearest neighbors regression

  1. becomes very slow as the training data gets larger,
  2. may not perform well with a large number of predictors, and

Do these points not map onto what you are saying? Or am I missing something?

I really like the use of RMSPE vs RMSE such a simple fix to avoid a lot of confusion!!

Thanks!! Credit for this decision really goes to @msalibian for this! It was his great suggestion.

trevorcampbell commented 2 years ago

Do we want to add a note saying that linear regression can be done in a way to model non-linear relationships, however doing this is beyond the scope of the book???

I don't think that adds any value and leaves the reader thinking "well, where do I find out about that then?". We already handle nonlinear relationships in about as much detail as I'd want to for the level of the book here https://datasciencebook.ca/regression2.html#designing-new-predictors

Lourenzutti commented 2 years ago

Sorry, I didn't mean to discuss how to handle a non-linear relationship with linear regression. My point was about the sentence:

Even a simple model such as linear regression can do that. By saying this it passes the idea that linear models cannot do that, which I think is misleading. Instead, one could say that knn can model arbitrary relationships between predictors and response, or that knn does not make any assumption on the shape of the relationship or something along those lines.

Anyway, that was my point. 😄

trevorcampbell commented 1 year ago

Revisiting this a year later. I'm going to extract the remaining specific "to-do" items and close this issue.

In STATS, the term multivariate is usually associated with multiple responses, not multiple predictors. I'd suggest changing "Multivariate Regression" to "Multiple Regression"

We got rid of all instances of this already.

I like the note at the end of Section 7.3, however it can cause confusion with ordinal categorical variables. "Time spent watching tv: none, little, medium, a lot" --> is "little less than a lot? Yes." Just not sure we want to bring up the idea of ordinality in this book due to the level of the audience... If you have a suggestion of how to tweak the note without bringing in the idea of ordinality, we would be open to improving the note.

I also agree, but can't think of a simple way of doing it without introducing ordinality and discussing differences with other categorical stuff. Just a kettle of fish we don't want to handle :-)

The following statement can be a little misleading in my opinion: One strength of the KNN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships (i.e., if the relationship is not a straight line). Linear regression can also fit some relationships that are not straight line. Students might associate Linear Regression with straight lines which are misleading. Maybe a better way of saying is that KNN makes no assumption about the form of the relationship?

Agree with your sentiment, but this isn't the right place to put this sort of remark because we haven't introduced linear reg yet. Actually I think we should put something like that in section 8.3, where we compare knn and linear (as Tiffany pointed out). Or maybe where we put the equation of the line, we demarcate that it's the linearity in the parameters that matters, not the variables. Anyway, I'll open a separate issue for that.

Maybe I would just add a quick note that the method can be very data-hungry as the number of predictors increases (to help students see a weakness of the method and justify the usage of simpler models with more stringent assumptions).

The things Tiffany pointed out sort of address this, but not fully. On the other hand, the sample complexity of various algorithms is really tough to discuss at a 1st year level beyond what we already do ("doesnt work with a lot of predictors, doesn't extrapolate well, etc"). I don't think it's worth addressing in our book beyond what's there now.