Suggestions for Ch6 - Classification 2

joelostblom commented 1 year ago

[x] np.random.seed is not recommended anymore
- [x] Instead we shuold create a random state object and use it methods
- [x] However, I don't think we will ever need to use np.random.seed. With both sklearn methods and pandas sample we will always control randomness via the random_state parameter.
- [x] I think it is more confusing than helpful to also talk about seeds here, and we can just show the effect of a different random_state when doing the train_test split instead.
[x] "only function that uses randomness in Python"
[x] We haven't taught how to use a lambda function like this. We could since it is useful and very flexible, or if we think it is too complex, we could teach how to perform the same operation either via cancer["Class"].map({'M': 'Malignant', 'B': 'Benign'}) (which would be convenient but a new thing) or simply filtering: cancer.loc[cancer['Class'] == 'M', 'Class'] = 'Malignant' + the benign version (which is less elegant but builds on what they have learned already)
[x] We are setting the seed in this cell but there is no randomness to control...
[x] We don't need to set the color range
[x] Downgrade to pandas <1.5 to avoid warning until next Altair release FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
[ ] Explain why it is important to use stratify and shuffle, not just mention what they do (include a short example for each in a sentence or so)
[x] This sentence is meant to include the actual number of observations "We can see from .info() in the code above that the training set contains observations, while the test set contains observations"
[x] The .info method always prints columns vertically, not just when there is a large number of them.
[x] We should not use groupby + count here. That creates a count for each column, so it doesn't generalize well. Instead we should use .groupby + size OR just use value_counts(normalize=True) which means we don't have to divide by the length after (btw we should get dataframe length via df.shape[0] not using len (applies to the entire book)). I don't think there is a normalize option to size, but can't double check because I'm in the air
- [x] We also shouldn't create a new dataframe unless we are using it again later.
[x] This sentence is meant to include actual numbers: "we see about % of the training data are benign and % are malignant"
[x] Imports seems to be missing for all sklearn stuff.
[x] Could we just call it "model" instead of "model specification" everywhere? Make it flow better so it is easier to read.
[x] Some introduction to why we use captical X and lower case y?
- [ ] X should be created as cancer_train.drop(columns='Class'), which is more generalizable
- [ ] We could possibly even skip this throughout the book and just pass the dataframe?
  - [ ] If we keep it, we should call it X_train and do the same for the test data.
  - [ ] I think we should probably keep the assignment because we used them a few times later, like when creating X_tune / y_tune in grdisearch, which should just be the same _train instead
[x] Comments about "weights uniform" and "hidden seed"?
[x] Why do we use concat to add the predictions? Can't we just create a copy of the new columns the regular way in the original dataframe (if it is OK to modify at this point)? Or use df.assign(pred=df['predictions']) to just print it without overwriting the variable (if we have taught assign before here.
- [x] We should also not overwrite variable names like this in my opinion.
- [x] Maybe something like this:
```
cancer_test_predictions = cancer_test.copy()
cancer_test_predictions['predicted'] = knn_fit.predict(cancer_test.drop(columns='Class'))
```
[x] X_test is created in the accuracy section but was needed in the previous code cell.
[x] I think we can do a better job explaining accuracy and start by computed it manually using y_test and the predictions created in cancer_test_predictions, so that students get a better grasp of what it is.
[x] Missing actual value in "The output shows that the estimated accuracy of the classifier on the test data was %."
- [x] I just realized that all of these might be fixed when the code actually runs...
[x] Do we ever even need to show confusion_matrix when we are saying that CnfusionMatrixDisiplay is the recommend way?
- [x] This is one of the many places where it feels unnecessarily noise to assign to a variable instead of just displaying the plot.
[x] We need to explain confusion matrices and what is in each square, there are resources in BAIT 509 and DSCI 573 for this.
[x] We talk about "majority classifier" but we never show how to create one. This is really easy to setup in sklearn and it is a useful strategy for students to learn so that they can compare against a baseline. I suggest we show them how to setup a dummy classifier (and a dummy regressor in later chapters). We might also consider renaming "majority classifier" to "dummy classifier" in the text so not be ambiguous
[x] The code in the cross-validation section repeats some of the things that were done earlier in the chapter like the train_test_split
[x] Cross-validation fig72 legend overlaps equation
[x] Quoting the docs for the cv param of cross_validate does not seem that helpful?
[x] The first time we use cross_validate, we do many things at once (cros_validate, reutrn_train, convert to pandas dataframe, set cv to 5 although its the default). It could be better to do these one at a time.
[x] "We can then aggregate the mean and standard deviation". Again Search and replace in the entire chapter.
- [x] We don't need func= in agg
[x] "provides two build-in methods"
[x] I think we should introduce grid search and randomized search one at a time to make it easier to understand and also include more explanation as well as an illustration. We can take this from BAIT 509 https://bait509-ubc.github.io/BAIT509/lectures/lecture6.html#automated-hyperparameter-optimization
[x] Do we cover what range does somewhere or do we just start using it?
[x] "the cross-validation accuracy actually starts to decrease!
[x] param_grid_lots is not a great name, something like large_param_grid sounds better. same with the plots name.
[x] Do we explain n_jobs=-1?
[x] Rewrite this "We can evaluate a classifier by splitting the data randomly into a training and test data set, using the training set to build the classifier, and using the test set to estimate its accuracy. Finally, we can tune the classifier (e.g., select the number of neighbors in K-NN) by maximizing estimated accuracy via cross-validation." to this "We can tune and evaluate a classifier by splitting the data randomly into a training and test data set. The training set is used to build the classifier and we can tune the classifier (e.g., select the number of neighbors in K-NN) by maximizing estimated accuracy via cross-validation. After we have tuned the model we can use the test set to estimate its accuracy."
[x] Swap step 2 and 3, since 2 and 4 belong together and 3 is independent of the two.
[x] Change step 4 to read: "Setup GridSearchCV (or RandomizedSearchCV) to estimate the classifier accuracy for a range of parameter values (values of "K" in our case)."
[x] Change step 5 to read "Execute the grid search by passing the training data to the fit() function of GridSearchCV instance.
[x] "yields a high cross-validation accuracy ~~estimate~~ that"
[x] Relating to step 6 and 7, GridSearchCV already provides us with the best model as an attribute, granted hat we still want them to look at the plot, is there value in teaching how to extract the already fit best model from GridSearchCV.best_estimator_ instead of refitting it (which is redundant and potentially time consuming).
[x] We don't show how we create the irrelevant columns
[x] "Forward selection ~~in Python~~"
- [x] In general I think this section is not that well-written and we should go through it and improve the language as well as how the different code snippets are introduced.

trevorcampbell commented 1 year ago

Downgrade to pandas <1.5 to avoid warning until next Altair release FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.

I'm just going to add code to suppress warnings rather than muck with package versions. We can remove that code later once altair upgrades.

trevorcampbell commented 1 year ago

np.random.seed is not recommended anymore Instead we shuold create a random state object and use it methods

I thought about this for a bit, and tried to implement both ways. I'm going to stick with np.random.seed.

Yes, it does have the drawback that other code may muck with the global seed, and you don't really know whether functions you're using are using the numpy default rng.

BUT for the purposes of students learning about how randomness in computers works as part of a 100-level course, it's just so much simpler and less error prone for students to teach them about setting a global seed.

Given that the drawbacks of np.random.seed only appear in parallel code and/or when using packages outside the course, I'm going to stick with np.random.seed.

That being said, I will add a notebox to tell students about the two caveats I mentioned above, and about RandomState objects.

However, I don't think we will ever need to use np.random.seed. With both sklearn methods and pandas sample we will always control randomness via the random_state parameter.

We shouldn't be setting the random_state manually throughout an analysis. Generally, most common PRNGs (mersenne, xoshiro and variants, pcg, etc) only work properly when they are allowed to run as a long chain.

I think it is more confusing than helpful to also talk about seeds here, and we can just show the effect of a different random_state when doing the train_test split instead.

Since we need to set the seed once at the start of the analysis, we need to teach them about what seeding does.

(They also see it a tonne in their worksheets anyway, so at least it isn't totally foreign to them).

trevorcampbell commented 1 year ago

We haven't taught how to use a lambda function like this. We could since it is useful and very flexible, or if we think it is too complex, we could teach how to perform the same operation either via cancer["Class"].map({'M': 'Malignant', 'B': 'Benign'}) (which would be convenient but a new thing) or simply filtering: cancer.loc[cancer['Class'] == 'M', 'Class'] = 'Malignant' + the benign version (which is less elegant but builds on what they have learned already)

@joelostblom I've been using .replace instead of .map in Ch5 because apparently map puts NaNs when it doesn't find a key -- see example here https://stackoverflow.com/questions/62947285/is-there-a-difference-between-series-replace-and-series-map-in-pandas . I think the default behaviour of replace is nicer (lazy coding :) ). I'll plan to do this in Ch6 too.

trevorcampbell commented 1 year ago

Could we just call it "model" instead of "model specification" everywhere?

Yes, that's a hold over from tidymodels. I'm using the terminology "model object"

trevorcampbell commented 1 year ago

We talk about "majority classifier" but we never show how to create one. This is really easy to setup in sklearn and it is a useful strategy for students to learn so that they can compare against a baseline. I suggest we show them how to setup a dummy classifier (and a dummy regressor in later chapters). We might also consider renaming "majority classifier" to "dummy classifier" in the text so not be ambiguous

I like this idea, but it's more of a "nice to have". I'll open an issue and punt for later.

trevorcampbell commented 1 year ago

Do we ever even need to show confusion_matrix

I'm just going to use pd.crosstab. It's cleaner.

trevorcampbell commented 1 year ago

We don't show how we create the irrelevant columns

We explain how they're created in the paragraph above. I don't think we need to show them the code for this because they shouldn't ever be adding random columns in practice.

trevorcampbell commented 1 year ago

We need to explain confusion matrices and what is in each square, there are resources in BAIT 509 and DSCI 573 for this.

It's good enough for now, punted to later as an issue

trevorcampbell commented 1 year ago

The code in the cross-validation section repeats some of the things that were done earlier in the chapter like the train_test_split

This is where we illustrate what cross val does manually (by manually generating 5 splits). We only show the first split. You'll notice we are splitting the cancer_train data now.

trevorcampbell commented 1 year ago

"Forward selection in Python"

It should be "in ____" because the previous section covers the conceptual stuff. I'll change to to "in scikit-learn", because thankfully it actually implements it, unlike tidymodels in R where we had to code it ourselves.

trevorcampbell commented 1 year ago

"We can then aggregate the mean and standard deviation". Again Search and replace in the entire chapter.

Note, crucially, we want the standard error here, not standard deviation. The sem function should be used. I'm fixing that.

trevorcampbell commented 1 year ago

"yields a high cross-validation accuracy ~~estimate~~ that"

It is an estimate -- I'm keeping that word

trevorcampbell commented 1 year ago

I think we should introduce grid search and randomized search one at a time

I don't think there's any real value in introducing two separate methods for this in this class. I'm just going to introduce grid CV.

trevorcampbell commented 1 year ago

Do we explain n_jobs=-1?

Nope and it isn't necessary; removing

trevorcampbell commented 1 year ago

Relating to step 6 and 7, GridSearchCV already provides us with the best model as an attribute, granted hat we still want them to look at the plot, is there value in teaching how to extract the already fit best model from GridSearchCV.bestestimator instead of refitting it (which is redundant and potentially time consuming).

I think this is a good thing to consider later (nice to have, not necessary right now). Will spin into its own issue.

trevorcampbell commented 1 year ago

In general I think this section is not that well-written and we should go through it and improve the language as well as how the different code snippets are introduced.

I agree. I'm going to comment it out for now entirely and reintroduce it later once we have breathing room. Issue posted.

joelostblom commented 1 year ago

I will start going through Ch6 now, might not finish until tomorrow. A few comments on what you wrote above:

We shouldn't be setting the random_state manually throughout an analysis. Generally, most common PRNGs (mersenne, xoshiro and variants, pcg, etc) only work properly when they are allowed to run as a long chain.

I am not familiar with the internals for how the PRNGSs work, but I think both pandas and sklearn recommend the use of random_state repeatedly over setting seeds globally, so there might be something to account for what you mention in those packages?

I've been using .replace instead of .map...

Great!

It is an estimate -- I'm keeping that word

I agree that the cross-validation is an estimate of the test accuracy. I read that sentence as that we are getting an estimate of the cross-validation accuracy, which we are not, since we are getting the actual cross-validation accuracy. But I'm not a native speaker so if it reads well to you then let's keep it.

UBC-DSCI / introduction-to-datascience-python

Suggestions for Ch6 - Classification 2 #70