Closed hfboyce closed 3 years ago
@kvarada @hfboyce I have a general comment here that will affect the whole course: I'm proposing we do everything in terms of scores and not introduce the notion of error (1-score) at all. I think that will be easier to understand and will make the code cleaner. What do you think? I realize it will take a bit of work for @hfboyce to redo some of module 3, and not sure about module 2. So we should probably discuss this decision on Monday.
Another general question: did we decide not to teach overfitting on the validation set in this course? I was thinking about the slides on the Golden Rule. These two things kind of go together, don't they?
Module 3 comments:
drop
syntax? I think I asked about this earlier. I would expect drop(columns=['country'])
.random_state
, "fit
our models. Validation: used to assess our model during model tuning. Test: unseen data used for a final assessment. cross_validate
, I would show us taking the .mean()
of the cross_val_score output, so that they can see us getting a single score. Maybe we can also point out that it's similar to the validation score we saw earlier.random_state
" maybe we should say it's for testing purposes, else they might think there is some ML reason to do this?;; let's name the variable cv_scores
because the CV "score" would normally refer to the average of these sub-scores.sort_values
by the cv score and then take the top entry with iloc[0]
.It seems Module 3 is tough in every course 😳 . Well, we knew this would be the hardest one I think. It's the hardest to understand, the hardest to teach, and I'm pretty opinionated about it (sorry). We'll get through it 🚀
I have a general comment here that will affect the whole course: I'm proposing we do everything in terms of scores and not introduce the notion of error (1-score) at all. I think that will be easier to understand and will make the code cleaner. What do you think? I realize it will take a bit of work for @hfboyce to redo some of module 3, and not sure about module 2. So we should probably discuss this decision on Monday.
Using scores instead of error everywhere sounds good to me. I did notice the inconsistency in my notes when I was recording. But then I didn't bother to change it. I guess we could talk about scores in general and mention once that you might hear people talking about error instead of scores, and in the context of classification, error is just 1 - accuracy score. Using scores instead of error also make these concept easier to understand for regression problems, as at this point they have only seen R2 score for regression problems.
1.15: Can we show a shallower version before showing the deep tree? This is a lot to take in at once.
Like just the first split ? I am confused. Do you want me to show a smaller tree or just part of this large tree. dept 2 on new slide
1.17: Again, I would start by showing some simpler boundaries. I would put the boundary and the tree side-by-side on the same slide for a very simple tree (depth 1 or 2).
So just redo this whole thing but with a tree of depth 2? but now I am confused because this may generalize better perhaps?
5.2: Let's start by repeating the earlier diagram, and then having the next slide be this expanded diagram. This expanded diagram is a lot to take in all at once.
Which diagram? just the train test split one?
Do they know what a "hyperparameter" is yet? We've used the term a couple times now. I didn't look at Module 2 so I'm not sure, just checking
Yes! I have a section in module 2 about them.
10.3: formatting with $k$. I don't think they have the knowledge to answer this because it's not explained. Maybe we should add this into the slides?
I was thinking this would be said in the part where we show the cross validate since the running time is there.
For these transcripts, are they copied from my/Varada's notes? I hope you don't spend too much time on them because I'll probably change them when recording. I thought we were going to skip them and then transcribe them after the recording, that's why I'm asking.
Mostly from Varada's notes. I do put some of my own in, just so you have an idea of where I was going with it.
For 13 in general, I wonder if we should just focus on two errors, either train/valid or train/test. I think E_best might be a bit much here. I would say we should either take it out entirely or have it later. Maybe later in the course we might have a section on practical tips, and we can move it there, basically saying you never know if you could have a better model or not.
Discuss in meeting.
18.6: we can clean up the code a lot here. I don't think this is the right time to introduce the error bars (std). I do think it'd be awesome if we introduce these, but they should be in the cross-validation section rather than this section. ;; let's also flip these to use scores instead of errors. ;; the bottom of this plot is cut off for me
Can we expand this question so that they actually produce the plot with Altair (given some starter code we provide) ? And then there could be some follow-up questions if we want, including one about the test error maybe, though not required.;; Also the plot in 21 is kind of wonky, the cv error goes up and down and then back up. Maybe using more folds might help smooth it out?
Discuss quickly
Hi @mgelbart! Ok buckle up for another round of module feedback(Also known as finding all of Hayley's grammar mistakes 🤦♀️- sorry sorry)!
Link here -> https://intro-machine-learning.netlify.app/en/module3
There should be 22 exercises. The most recent change was the additional Q 16( This should have an image of a decision tree in it and the question asks if it's more likely underfitting or overfitting)
I am doing the amendments to Assignment 2 today and so I should have Assignment 3 for you in the next 2 days (possibly Friday EOD)
Hope all is going well with juggling both your courses!