Wow, this is a long chapter! Especially with that optional section on feature selection; it's really good though!
[x] "If there were a golden rule of machine learning" -> "If there was a golden rule of machine learning"
[x] Not that important, but we sometimes say "training/testing set" and sometimes "train/test set".
[x] In fig 6.2, the far right version of the test set used to include the class label and have a heading that said "predict class for test set". I think this was clearer than the current version, where it looks like nothing happens in the last set as the test set is the same as in the previous step.
[x] Same comment as previously about table caption being above the table while fig captions are under the tables
[x] For the "accuracy" equation, we don't end it with a period, but for the three equations related to precision and recall, we end them all with a period. We should be consistent across all chapters regarding whether we use a period or not. My personal opinion is that it looks horrible to end an equation with a period and I would never do that when writing by hand (I know it is often done in digital documents, but I would much rather change the preceding sentence to end in a period of colon.
[x] In the note just before 6.4 we say that never guessing positive is the same as a perfect recall score, but technically it would be an undefined recall score since we are dividing by zero. Then it is implementation specific what score is substituted for that undefined value. E.g. sklearn will input a 0 and warn by default (so lowest possible recall), with an option to input 1s instead. Maybe we should either omit or say something like "almost never" or "is very conservative"?
[x] When we create random numbers for the last time (with seed 4235), we call the variable random_numbers twice, whether we appended a number above. Maybe we should append a number here too for consistency?
[x] We say "Well, sample is certainly not the only data frame method that uses randomness in Python. Many of the functions that we use in scikit-learn, pandas, and beyond use randomness", but I think sample is literally the only data frame (and pandas) method/function that uses randomness. Maybe change to: "Well, sample is certainly not the only place where randomness is used in Python. For example, many of the functions in scikit-learn use randomness, as we will see in this and later chapters.
[x] Fig 6.3 links to the bottom of the fig instead of the top
[x] For comments in code, we sometimes capitalize them and sometimes not. I think capitalized looks better, but the effort-reward ratio on making this consistent might not be worth it.
[x] In the equations at the end of 6.5.6, we use the word "test set" only in the denominator for recall, which might be confusing since the predictions are also only for the test set. I suggest removing "test set" from the recall equation
[x] Not that important right now, but in 6.6.1 we say: "If we just split our overall training data once, our best parameter choice will depend strongly on whatever data was lucky enough to end up in the validation set.". If I was a student and read that I would wonder if that also means that our evaluation also depends on how lucky since in that case we do only make a single split, which is something we never comment on anywhere.
[x] The text of thee samples is a bit small in figure 6.4. If there is time, it could be nice to make it bigger; maybe including fewer samples overall if needed.
[x] In 6.6.1 we write "...we use another function: cross_validate. This function requires that we specify a modelling Pipeline as the estimator argument", which makes it sounds like only pipelines are accepted whereas in fact any sklearn estimator is OK even if it is not in a pipeline.
[x] In 6.6.1 we write "The validation scores we are interested in are contained in the test_score column." This might be confusing since we have talked about test data before and that we couldn't use it before evaluation, maybe append a parenthesis "(although the name of this column is test_score, it is using the validation data and not the test data that we have set aside for evaluation.)"
[ ] In the text just under fig 5, we say that any value 30 to 80 is acceptable; I think it makes sense to say 30 to 50 instead (note that this figure has changed from the published version). Similar issue in the R version.
[ ] Should the y-axis for fig 6.10 be called something like "Best number of neighbors" for clarity?
[x] In figure 6.12 we mention an elbow but this will be upside down from when we talk about an elbow for clustering. Not that important
[x] The McKinney 2012 ref got to ch 3 bibliography instead of this chapter
[x] The "james et al" ref in the additional resources goes to the bibliography in ch 8 instead of this chapter.
Wow, this is a long chapter! Especially with that optional section on feature selection; it's really good though!
RandomState
goes to an old version of the docs. The new one is https://numpy.org/doc/stable/reference/random/legacy.html#numpy.random.RandomState; however that page mentions explicitly that this is the legacy way of doing things and no longer recommended so maybe we should change this note to use aGenerator
viadefault_rng
instead (https://numpy.org/doc/stable/reference/random/generator.html)?to_numpy()
random_numbers
twice, whether we appended a number above. Maybe we should append a number here too for consistency?sample
is literally the only data frame (and pandas) method/function that uses randomness. Maybe change to: "Well, sample is certainly not the only place where randomness is used in Python. For example, many of the functions in scikit-learn use randomness, as we will see in this and later chapters.test_score
, it is using the validation data and not the test data that we have set aside for evaluation.)"