Feedback Module 8! - Githubissues

@mgelbart Here we go!

First round coming in HOT 🔥 ! As I said, this is one of my least happy modules and it feels a little choppy to me.

I am ready for some feedback to fix it up.

You can find it here -> https://ml-learn.mds.ubc.ca/en/module8

Assignment coming tomorrow/monday.

[x] Ex 1 and Ex 2 should be renamed from linear classifiers to linear regression
[ ] 1.5: it would have been nice to also show an example with lots of features where alpha=0 is not the best choice - maybe one of the previous regression datasets they've seen already? with a plot of scores vs. alpha? e.g. like the plot here .
[x] 2.1: "well know"
[x] 2.2: I think this is a fair question, but we should show a picture of this in Ex 1. Can you make one? If it's going to take a while then it's not worth the effort, but in that case I think we should remove this MC question. Update: on further thought, it's probably not worth the effort? they may not be equipped to understand that sort of 3d plot.
[x] 4: have they already seen make_scorer?
[x] 4: there are conflicting mentions of grid search vs. randomized search. I think grid search makes more sense here with only 1 hyperparameter.
[x] 5.2: "Intuition behind linear classifiers" -> "Intuition behind linear regression"
[x] 5.2: can we show the data in a table instead of free text, and then move the coefficients to the next slide?
[x] 5.3: this should say predicted price instead of score(price). Also the LaTeX looks quite funky on my screen:

[x] for this equation, let's change x_1 to # of bedrooms, change x_2 to # of bathrooms, etc. let's avoid the "x" notation. We could even do it not mathy at all, like (coefficient for # of bedrooms)x(# of bedrooms) + (coefficient for # of bathrooms)x(# of bathrooms) + ...
[x] let's use the terminology "intercept" instead of "bias" for this module, to be consistent with sklearn where it's retrieved with model.intercept_.
[x] last thing: 66 year old -> 66 years old
[x] 5.4: same comments as before: the math format, and score(price)
[x] 5.7: mild preference to use the term "coefficients" instead of "weights" everywhere for consistency with sklearn / reduce terminology overhead (I'm fine mentioning once that they can also be called weights but then stick with coefs)
[x] 5.8: I'm hesitant about this because the features are not scaled. remove this slide? UPDATE: this shows up in 12.1. Ok so maybe we need to keep this in but add a cautionary note that it depends on the scaling of the features, because larger features will have smaller coefficients, but if we scale then they are kind of on a level playing field.
[x] 5.9: let's just call them weights.
[x] 5.12: same comments as before; otherwise nice slide
[x] 6: same issue with "score"
[x] 7.1: terminology weight->coef
[x] 7.2: "Coefficients help us interpret our model." -> "Coefficients can help us interpret our model."
[x] 8: "Using the same Ridge model as we obtain last time" -> obtained
[x] 8: did we show them how to access the intercept? I think we did, nvm.
[x] 8: to make this more human-relatable, maybe let's mention the units. are they m and kg? then we could say, "for a player who is 2.05m tall and 93.2 kg". Just to spice things up with some real-life measurements 🌶️
[x] 8: any chance someone will get confused and think 9.172344e+06 is not the same as 9172344.01129167?
[x] 9.2 transcript: "It’s very similar to the Ridge classifier" -> but Ridge is not a classifier
[x] 9.5: x-axis label cut off on my default zoom level
[x] 9.7: in the talking part I will mention the "gotcha" of having to do lr.coef_[0] instead of lr.coef_
[x] 9.8: on this slide I think it's worth printing out lr.classes_ so we can see the order, i.e. that Canada is negative and US is positive in the model's brain 🧠
[x] 9.9: again, kind of sad that the lowest regularization model does best. let's see if we can get a good example for at least one of the two cases (ridge or logisticregression) where it isn't the case?
[x] 10.1: more context is needed here, I was totally not expecting word counts. can you add some introductory sentences about how CountVectorizer was used and so each column is a word, and since each coef is a column we now have one coef per word? and also the fact that the targets are positive or negative sentiment here. Maybe that would be worth adding as a last slide on the previous slide deck so there could be video footage introducing this? or is text enough in 10? I was also caught a bit off guard by the text itself. Maybe you could say "We have the following text, which we wish to classify as either positive or negative". when this is done let's touch base again so i can check that the context is sufficient.
[x] also for 10 there is a lot of math to do, could we reduce it to just 3 or 4 words?
[x] also for 10: curious about the error message for 0.8: "are you forgetting the intercept?" because 0.5-0.8 != 1.3.
[x] 12: the follow-up questions are lit
[x] 13.4: need to fix LaTeX in the usual ways - score, bias->intercept.
[x] 13.4: the figure is cut off for me.
[x] 13.8: the output is cut off for me.
[x] 14: maybe say column 0 and column 1? also, some context would be good, on what the problem is? I guess predicting if an applicant will be hired?
[x] 14.1: missing question mark
[x] 14.2: feels like a bit of a hassle, can we reduce this to 5-6 examples instead of ~10?
[x] 15: have we defined "positive class" explicitly? maybe this can be mentioned when we add in the classes_ I mentoned earlier.

Will review 16-20 soon

[x] 16: naming it pkm_grid even thought it's random search?
[x] 16: when running the code I got a warning. This can be solved by setting max_iter=1000 in the log reg. May be worth having them do this.
[x] 16: would a follow-up question be nice here, to interpret the table?
[x] 17.6: confusion matrix is cut off for me.
[x] 17.8: plot cut off
[x] 17.9 x-axis label cut off
[x] 18.1 (and also relevant to 17): I had this on my todo list for a while so I finally looked into it more. Here is my conclusion: for the class with the largest positive coefficient, increasing that feature always increases that probability. So in this example, increasing draft_round will increase the probability of Guard, but it may or may not increase the probability of Forward!! Because in fact if draft_round is huge then the probability will go to 99.9999 for Guard and the probability for Forward will be decreasing. So we have to be quite careful here with our interpretations. I will also add this to my CPSC 330 course. So, TLDR here, this wording should hopefully be safe: "For which feature does increasing its value always push the prediction away from the Other class?"
[x] 18.2: this question can't be salvaged in light of the above. delete?
[x] 19.1: was this covered in the slides?
[x] 19.2: possibly out of scope? maybe you can delete 18.2 and 19.2, and then combine 18.1 and 19.1 by turning 19.1 into a multiple choice of some sort? Even a MC with only 2 answers?
[x] 20.2: I answered "50" and it said "Incorrect. Great!" heh. Also, this is a tough one! 🤔

1.5: it would have been nice to also show an example with lots of features where alpha=0 is not the best choice - maybe one of the previous regression datasets they've seen already?

Ok this was a problem. I tried a bunch of different datasets using a single feature but the problem I was encountering was that it did not plot very nicely. So I guess my question here is what is more important, a plot that is easy to understand or a not lowest alpha value. Of this hyperparameter tuning and the logistic regression one, it's easier to change this one ( since we use the dataset for logistic regression for quite a few slide decks, but would you be ok sacrificing the plot for it?

4: have they already seen make_scorer?

Yep! I talk about it in Module 7!

5.8: I'm hesitant about this because the features are not scaled. remove this slide? UPDATE: this shows up in 12.1. Ok so maybe we need to keep this in but add a cautionary note that it depends on the scaling of the features, because larger features will have smaller coefficients, but if we scale then they are kind of on a level playing field.

I added this in the transcript notes of the same slide, is that ok?

5.9: let's just call them weights.

Wait wait, I thought we were calling them coefficients?

any chance someone will get confused and think 9.172344e+06 is not the same as 9172344.01129167?

We explained this in the first module of PPDS (a practice problem)

9.9: again, kind of sad that the lowest regularization model does best. let's see if we can get a good example for at least one of the two cases (ridge or logisticregression) where it isn't the case?

See 1.5 above. This was also a bit of a mess. I found it had finding a good way of showing this the values actually changing somewhat reasonably.

15: have we defined "positive class" explicitly? maybe this can be mentioned when we add in the classes_ I mentoned earlier.

We did heavily in Module 7!

17.9 x-axis label cut off

But I have so much room at the bottom of mine 😭 How big is your screen? This is mine at 100%.

Ok, Majority of the changes are done. Just need to figure out Question 10 and adding a slide in deck 9. I will do that in the morning. In the mean time I can pass the exercises to Elijah to make tests for so I can review them before Friday.

Ok this was a problem. I tried a bunch of different datasets using a single feature but the problem I was encountering was that it did not plot very nicely. So I guess my question here is what is more important, a plot that is easy to understand or a not lowest alpha value. Of this hyperparameter tuning and the logistic regression one, it's easier to change this one ( since we use the dataset for logistic regression for quite a few slide decks, but would you be ok sacrificing the plot for it?

Yeah, I don't think a single feature would work - basically any dataset with a single feature will probably prefer alpha=0 as the best. Maybe here we should consider deviating from our rule of fewer datasets and have separate datasets for the plot vs. the alpha tuning?

Yeah, it's fine to do it here and leave the logistic regression alone.

I added this in the transcript notes of the same slide, is that ok?

Yeah sounds good.

Wait wait, I thought we were calling them coefficients?

I have no idea what I was talking about. Please disregard.

But I have so much room at the bottom of mine 😭 How big is your screen? This is mine at 100%.

I think it's that I have a very high-resolution screen. Anyway this is what I see. There's a lot of whitespace at the top above the picture that could be removed. Sometimes with matplotlib calling plt.tight_layout() before saving helps with this - not sure if that would help here.

UBC-MDS / introduction-machine-learning

Feedback Module 8! #35