Chapter 12 exercise - Githubissues

aef1004 commented 4 years ago

Here is the Chapter 12 exercise. Let me know if I should make any additional changes!

geanders commented 4 years ago

@aef1004 : Great start on this! I have some suggestions below for some edits. These are mostly just some suggestions for small edits to the text or areas where it would be helpful to add some explanations or more discussion to the text. As long as we leave the pull request open, any changes that you make and push to your own GitHub repo for this should come straight through this pull request.

Make a pass through and make sure that you use the "backtick" formatting for any word in the text that's a package name or R code (including an R object name). It seems small, but it really does help a reader's brain process the text, because it serves as an immediate visual clue that the word isn't in regular English. Some words that aren't formatted this way (in at least one place) but could be include: ElemStatLearn, glmnet, cv.glmnet, and lpsa.
It's great that you repeat the prompt from the exercise in the book. This would be a great place, though, to use a quotation environment, to set this off from the rest of the post. Try adding "> " to the beginning of the paragraph that starts, "Use glmnet for a prediction ...".
In the section on "Data for the exercise", I think some of the text might be straight from the practice code that I posted for the week. It's absolutely fine to bring in some of the code from that, but we would like to have your description here in your own words. Could you edit that section to describe everything yourself in any sections where the language is very similar or the same as in the practice code document?
Great idea to split into testing and training datasets. You might want to add a comment that, because this data was posted specifically for use in machine learning examples, the authors included this "train" variable, so you can just split on that. You might also want to have a short note about what you could do if you were working with data like this but someone hadn't put in a variable like this. One way would be to add that column yourself, and use the sample function to sample from TRUE and FALSE at the proportions that you want.
Very nice use of in-line code in the blog post text to give values like the number of observations in the testing and training datasets!
There's a typo in one of the section headers ("generaltized")
I'll let @baileyfosdick correct me if I'm wrong, but I think that in "lasso penalty" and "ridge penalty", "lasso" and "ridge" are usually in lowercase. You might want to see if you can figure out what the best convention is for that and make sure you use whatever it is (lowercase or uppercase for "lasso" and "ridge") throughout.
For the section on getting the data, one of the points of this exercise seems to be to get you to try lasso and ridge regression when your outcome is continuous (rather than binary, as with the example given in the text of the chapter). It might be worthwhile to note that the outcome we're trying to predict (lpsa) is a continuous variable, and maybe even to show that with a summary of the variable or something like a histogram (which would show the range of the observations for that value as well as give a general idea of whether those measures follow a normal-ish distribution).
One question that it's always worth asking when you fit a model with lots of predictors is how the predictors are related to each other. Are there two, for example, that seem to be measuring almost the same thing and so are strongly correlated? Or maybe two are well-correlated within a certain range of values, but then really differ in other ranges of their values. It would be helpful to try to illustrate connections or associations among the predictors here when you introduce the data. There's likely a good function for making a plot for that in the GGally package.
It's great that you explain how the value set for alpha in glmnet translates to the lasso versus ridge penalties. It might be helpful to add a sentence or two here about why you might want to add a penalty like these when you're fitting a regression model with the goal of using it to predict an outcome. Also, the use of alpha as the R function parameter in this function is consistent with the nomenclature in the equation given in the book in the paragraph right under equation 12.8 (p. 327 in the text version)---the one for pen(beta), which expresses the penalty function in terms of alpha and beta for the elastic net framework. You'll see that if alpha goes to 0, you're left just with |beta|^1 (which if you look earlier in that paragraph is the lasso penalty), while if alpha goes to 1, you're left with just |beta|^2 (which is the ridge penalty). If might be worth adding just a bit of text about this when you introduce this idea of shifting alpha in glmnet to get ridge and lasso penalties (and maybe even put in that equation from the book for the penalty function in the elastic net!).
Since some of the interpretation of the plots requires knowing what predictors were used, and in what order, in the glmnet call, I think it might be worthwhile to assign the predictors and outcomes to their own R objects before running the glmnet calls. Also, I think it would be worthwhile to show the first few rows of the predictors, so you can refer back to that for readers to check the order of variables to compare to the label numbers in the plots. Here's how an update to the code to do that might look:

predictors <- prostate_train %>% 
  dplyr::select(lcavol:pgg45) %>% 
  as.matrix()
head(predictors)

target <- prostate_train %>% 
  pull(lpsa)

glmnet(x = predictors, 
       y = target, 
       family = "gaussian", alpha = 1)

Nice use of labeling and titles in the plots for the glmnet outputs! Could you add a note, either in code comments right in the code or in the text, about how you're using the label = TRUE option and the title function to do this? Could you also add a short sentence when you describe these plots to explain that the numbers by each line are labeling which variable is being shown, with the number corresponding to the order of the variable in the matrix of predictors input to glmnet (so, "1" = the first column in the predictors, "lcavol", and so on)? You almost get there in the current version, but I do think there is a small logical step in "getting" that the numbers correspond to the order of the predictor columns in the original predictor matrix that might not be clear to everyone immediately without a little explanation.
For the text, "the plots show the coefficients", I think maybe you should add "estimated" before "coefficients".
If you want, you could make the Greek letters really Greek in the test. For example, you could type "$\lamda$" and "$\alpha$" where you currently have "lambda" and "alpha" in the text, and with RMarkdown, it should come through as the Greek letters. Alternatively, when you're using these in the sense of the R function parameter alpha, you could instead use backticks to make it clearer that these are values in code.
Can you explain somewhere, just in a few words, what the "L1-norm" is? You discuss this when describing some of the plots (which is great!), but some readers might not immediately remember what that is.
I think there's a typo somewhere in here: "In the plots show the coefficients"
In the start of the section on cross-validation, could you briefly explain what our aim is in cross-validation for this model? In particular, in this case we're running the function using only the training data. Presumably, we can do more cross-validation later, to check the final model with new data (data not used in building the original model). Often when there's cross-validation when actually building the model, it's to find some optimal value of a tuning parameter in the model. In this case, it looks like maybe we're doing this to try to find the best value to use for lambda? The model output looks like it gives minimum and LSE values for lambda, which makes me think that's what we're trying to figure out through this cross-validation. Could you explain the goal of what we're doing just a bit more before you go into the code? (Once you run the code, the text that explains the interpretation of the plots is really nice.)
I think there's a typo somewhere in this text: "In the data output, 1se means the data point what is within"
Later in that paragraph, I think "This it " should maybe be "This is "?
I suggest that you add to "that the model suggests that we use " the phrase "in our ultimate predictive model".
Should "1se Measure" be "lse Measure"?
The sections at the end that compare the predictions and true values in the test dataset are really nice additions! Maybe add just a bit in these sections (or the end of the cross-validation one) explaining that, at the end of the cross-validation process, we have one "best" predictive model that we've come up with, that uses the "best" value of lamda based on what we found out from the cross-validation (is this described correctly?). Then, it's easier for the reader to follow through to this idea that we now have a "final" model for both lasso and ridge and can go back and see how well those models do in the test dataset, which we've held out until now and not used at all in building these models.
I think that you might want to capitalize the "R" in "R^2" throughout in your text (and awesome that you've got the superscript working in the text for that!).
I love the scatterplot comparing the observed and predicted values. However, I wouldn't recommend that you fit the linear regression comparing the predicted to observed values. It seems a touch circular, where it's trying to compare two things, where one's the result of already fitting a regression model (even if it was penalized). I don't know, it just started feeling kind of circular to me. For the scatterplots, I suggest that you add coord_fixed, along with xlim and ylim values that force the axis ranges to be the same for actual and predicted measurements of each value. For example, for the ridge regression section, the modified code might look like:

ggplot(actual_ridge_predict_df, aes(x = actual, y = prediction)) +
  geom_point(color = "#FF62BC", size = 2) +
  geom_abline(slope=1, intercept=0) +
  ggtitle("Ridge Prediction") +
  theme_light() + 
  coord_fixed(xlim = c(0.75, 5.6), 
                       ylim = c(0.75, 5.6))

Since an ideal model would show each plot point to be exactly the same on the x- and y-axes (that is, actual and predicted are always equal), it can be helpful to show this comparison plot in a way that will help in seeing how the real data varies from that ideal. By forcing the scales to have the same ranges, the "equals" diagonal line will cut straight from the bottom left to the top right, and it will be easy to see regions where the model tends to over- or underpredict the values.

There's a typo: "acutal ". Also: "penatly".

geanders commented 4 years ago

@aef1004 : I see that you just pushed several commits with edits a few hours ago (and thank you for making those edits so fast!). Is this now ready for me to merge in and publish on the site? Since you did this in multiple commits, I wanted to make sure that you're set and not merge too soon.

aef1004 commented 4 years ago

@geanders I just finished with a final push, so it should be ready to merge and publish now!

geanders commented 4 years ago

@aef1004 Excellent! It's now published: https://kind-neumann-789611.netlify.app/post/exercise-solution-for-chapter-12/

geanders / csu_msmb

Chapter 12 exercise #66