Closed aef1004 closed 4 years ago
@aef1004 : Great start on this! I have some suggestions below for some edits. These are mostly just some suggestions for small edits to the text or areas where it would be helpful to add some explanations or more discussion to the text. As long as we leave the pull request open, any changes that you make and push to your own GitHub repo for this should come straight through this pull request.
sample
function to sample from TRUE
and FALSE
at the proportions that you want. lpsa
) is a continuous variable, and maybe even to show that with a summary of the variable or something like a histogram (which would show the range of the observations for that value as well as give a general idea of whether those measures follow a normal-ish distribution). GGally
package.alpha
in glmnet
translates to the lasso versus ridge penalties. It might be helpful to add a sentence or two here about why you might want to add a penalty like these when you're fitting a regression model with the goal of using it to predict an outcome. Also, the use of alpha
as the R function parameter in this function is consistent with the nomenclature in the equation given in the book in the paragraph right under equation 12.8 (p. 327 in the text version)---the one for pen(beta), which expresses the penalty function in terms of alpha and beta for the elastic net framework. You'll see that if alpha goes to 0, you're left just with |beta|^1 (which if you look earlier in that paragraph is the lasso penalty), while if alpha goes to 1, you're left with just |beta|^2 (which is the ridge penalty). If might be worth adding just a bit of text about this when you introduce this idea of shifting alpha
in glmnet
to get ridge and lasso penalties (and maybe even put in that equation from the book for the penalty function in the elastic net!).glmnet
call, I think it might be worthwhile to assign the predictors and outcomes to their own R objects before running the glmnet
calls. Also, I think it would be worthwhile to show the first few rows of the predictors, so you can refer back to that for readers to check the order of variables to compare to the label numbers in the plots. Here's how an update to the code to do that might look: predictors <- prostate_train %>%
dplyr::select(lcavol:pgg45) %>%
as.matrix()
head(predictors)
target <- prostate_train %>%
pull(lpsa)
glmnet(x = predictors,
y = target,
family = "gaussian", alpha = 1)
glmnet
outputs! Could you add a note, either in code comments right in the code or in the text, about how you're using the label = TRUE
option and the title
function to do this? Could you also add a short sentence when you describe these plots to explain that the numbers by each line are labeling which variable is being shown, with the number corresponding to the order of the variable in the matrix of predictors input to glmnet
(so, "1" = the first column in the predictors, "lcavol", and so on)? You almost get there in the current version, but I do think there is a small logical step in "getting" that the numbers correspond to the order of the predictor columns in the original predictor matrix that might not be clear to everyone immediately without a little explanation. alpha
, you could instead use backticks to make it clearer that these are values in code.test
dataset are really nice additions! Maybe add just a bit in these sections (or the end of the cross-validation one) explaining that, at the end of the cross-validation process, we have one "best" predictive model that we've come up with, that uses the "best" value of lamda based on what we found out from the cross-validation (is this described correctly?). Then, it's easier for the reader to follow through to this idea that we now have a "final" model for both lasso and ridge and can go back and see how well those models do in the test dataset, which we've held out until now and not used at all in building these models.coord_fixed
, along with xlim
and ylim
values that force the axis ranges to be the same for actual and predicted measurements of each value. For example, for the ridge regression section, the modified code might look like: ggplot(actual_ridge_predict_df, aes(x = actual, y = prediction)) +
geom_point(color = "#FF62BC", size = 2) +
geom_abline(slope=1, intercept=0) +
ggtitle("Ridge Prediction") +
theme_light() +
coord_fixed(xlim = c(0.75, 5.6),
ylim = c(0.75, 5.6))
Since an ideal model would show each plot point to be exactly the same on the x- and y-axes (that is, actual and predicted are always equal), it can be helpful to show this comparison plot in a way that will help in seeing how the real data varies from that ideal. By forcing the scales to have the same ranges, the "equals" diagonal line will cut straight from the bottom left to the top right, and it will be easy to see regions where the model tends to over- or underpredict the values.
@aef1004 : I see that you just pushed several commits with edits a few hours ago (and thank you for making those edits so fast!). Is this now ready for me to merge in and publish on the site? Since you did this in multiple commits, I wanted to make sure that you're set and not merge too soon.
@geanders I just finished with a final push, so it should be ready to merge and publish now!
@aef1004 Excellent! It's now published: https://kind-neumann-789611.netlify.app/post/exercise-solution-for-chapter-12/
Here is the Chapter 12 exercise. Let me know if I should make any additional changes!