UBC-DSCI / introduction-to-datascience

Open Source Textbook for DSCI100: Introduction to Data Science in R
https://datasciencebook.ca/
Other
48 stars 53 forks source link

another classifier in classification chapter? #232

Open leem44 opened 2 years ago

leem44 commented 2 years ago

Posting #106 Reviewer E's comment here about classification chapter:

trevorcampbell commented 2 years ago

I'll make this a v2 enhancement, with the goal of adding logistic regression. But in reality this is somewhere between "v2 enhancement" and "blue sky enhancement".

trevorcampbell commented 11 months ago

Just a bit of brainstorming on this as a followup: if we do this, I think it might make the most sense to have a new chapter, Classification III, that covers logistic regression. That way we can mimic the structure of Regression II, which is the equivalent chapter for regression. If we tried to do LogReg in Classification I or II, we wouldn't be able to do that (b/c there will be concepts that weren't introduced yet).

This would also involve editing Reg II to avoid repeating ourselves when we get to Lin Reg

ttimbers commented 11 months ago

I worry a bit about introducing logistic regression before linear regression... and we use knn classification as a gateway to knn regression, so we're kind of tied to classification and then regression...

Maybe all this wouldn't be a problem if we place Classification II after Regression II? So it's kind of like a classification sandwich? Alternatively, we choose some other algorithm for classification II? Decision trees could be good? They're the basis for some of the most popular and best performing ML models right now? Or we could choose SVMs?

trevorcampbell commented 11 months ago

Thanks for brainstorming :)

Decision trees could be good? They're the basis for some of the most popular and best performing ML models right now? Or we could choose SVMs?

I'm definitely on board with adding more interesting classification models like decision trees / forests / SVMs / NNs.

If I had to pick more classification models to add, I'd go with LogReg first (because it's almost as simple as linear regression, very popular, and a nice counterpart to the uninterpretable knn stuff) and then Decision tree/forest because it has a really nice algorithmic / intuitive description of how it classifies things. SVMs and NNs are harder to introduce at the level of this textbook -- esp SVMs... -- but I don't think it's impossible.

Maybe all this wouldn't be a problem if we place Classification II after Regression II?

Hmmm, I don't think that will work -- that would cause a huge rewrite of at least 3 chapters -- since Reg 1 & 2 rely on knowing about cross val / tuning / etc from Cls 2.

Alternatively, we choose some other algorithm for classification II?

For me the purpose of Cls 2 is mostly to introduce evaluation / tuning. I wouldn't want to introduce a new classifier at the same time, just to avoid overloading people. That would also involve fairly heavy editing on an already polished chapter.

and we use knn classification as a gateway to knn regression, so we're kind of tied to classification and then regression...

I don't think it's super important to jump directly from knn classification to knn regression. We already space them out by Cls 2, which is all about tuning/eval. If we had a new "classification 3", at the beginning of "regression 1" we would just keep the same introduction to regression problems, and make very minor modifications to the text to say that we're going to introduce regression with a k-nn-based model, just like we did in classification.

I'm still fairly convinced that the most natural place to introduce new classifiers is in a new "Classification 3" chapter. It also makes it natural to later on consider adding other regression models in "Regression 3", but we could merge those into Reg 2 as well.

trevorcampbell commented 11 months ago

Just documenting one point from an in-person chat with Tiffany: we probably want to avoid introducing new classifiers in the actual DSCI100 course itself, to avoid conflict with other existing classes (CPSC330 notably). But adding to the textbook can be independent of that.

One other potential issue with "Cls 3" chapter: introducing logistic regression before linear regression will be a bit awkward. Maybe best to stick with decision tree/forest?

Probably will punt this edit for now and return to it later.