Review comments: Episode 3 - regularised regression

mallewellyn commented 8 months ago

Another nice episode. Although it's quite long, I think it covers what are often quite challenging ideas in a very approachable way. Most of the comments I have relate to how regularisation is motivated and some minor re-wording to clarify. I do, however, have a more challenging query about the placement of the linear regression background information. I have highlighted this in bold below!

Again, I will submit pull requests where possible and very happy to discuss anything.

Line 22/Key points: I can't see where this last key point re computational speed is discussed. Potentially omit or add a brief comment about this in the text?
Line 46/Introduction: The end of this sentence could clarify the differences between the approach of this episode and the previous one. Something like: "that instead [what regularisation does and how it differs from information sharing]".
Line 79/Introduction: I like the example to show what happens when you fit a model using all features. Could add a sentence like "We explain what singularities are and why they appear when fitting high-dimensional data below" to just clarify that this is addressed!

Also, a brief explanation of why some effect sizes are very high as this doesn't seem to be addressed.

Line 84/Singularities: I think it needs to be clearer why high-dimensional data result in singularities (i.e., explicitly saying that determinant of matrix is zero for high-dimensional data, thus singularities are always present when fitting models naively to high-dimensional data and R often cannot fit the model).
Line 141/Correlated features -- common in high-dimensional data: Perhaps this summary of the challenges could be placed outside of the pinned box. I think it deserves more attention as it succinctly summarises both of the above points and motivates the need for regularisation.

Also, the fact that p>n is problematic is discussed in a lot of detail (resulting in singularities), and so I think should be added to this point to justify the use of regularisation (or else the discussion on high-dimensional data being problematic by its very size could be removed and just discussion of correlations retained)!

Something like: "Regularisation can help us to deal with correlated features." -> "Regularisation can help us to deal with correlated features, as well as effectively reduce the number of features (dimension) in our model, and thus addresses these issues".

Line 152/Challenge 1: "Discuss in groups:" could be "Consider or discuss in groups:" to make suitable for an individual learner.
Lines 175 & Line 311: Query - it feels, from the section above, as though you are about to describe regularisation as this is now fully motivated. These regression sections therefore confused me a bit. There's a similar interlude in the previous episode. Could it be possible to cover this as preliminary material at the start of the episode, reference another tutorial on linear regression + training and testing models and reference this throughout all episodes, or use this as a "bridging" episode at the start of the formal set of episodes and back-reference?

Could then provide the information re restricting the model after the example (where we try to fit a linear model on the whole data set) as further motivation and explain how this relates to generalisability.

Line 515/Using regularisation to improve generalisability: A brief few words about the trade-off here with the extent of regularisation would help to clarify.
Line 516/Why would we want to restrict our model?: "This type of regularisation is called ridge regression" makes it sound as though ridge regression=OLS. Slight re-wording to tie ridge regression to the non-zero penalty may help.
Line 518: Somewhere in this section, could possibly add a title to improve signposting of the ridge method (since there's one for lasso).
Line 520: These problems already feel explained and thus I would propose a minor re-wording to make this a recap and demonstration of the fact that regularisation actually works.
Line 742/Cross-validation to find the best value of $\lambda$: Haven't defined what a 'good' value for lambda looks like (to define what the 'best' looks like)
Line 749: Cross-validation to find the best value of $\lambda$: "Cross-validation is a really deep topic that we're not going to cover in more detail today, though!" possibly a slight re-wording as the subsequent analysis seems to cover cross-validation in a bit more detail.
Line 795/Blending ridge regression and the LASSO - elastic nets: Would possibly re-order this title to put elastic nets first if also using lasso and ridge titles (just for signposting).
Line 1019/Other types of outcomes: It is a little unclear to me that the intercept-only model is selected. Could annotate the code to demonstrate how this happened.

Minor comments

Line 87/Singularities: remove comma before "and why are they..."
Line 441/Using regularisation to impove generalisability: "This can be done with regularisation. The idea to add another condition to the problem we’re solving with linear regression." -> "This can be done with regularisation: adding another condition to the problem we’re solving with linear regression."
Line 512/Using regularisation to impove generalisability: "punished" -> "punishes"
Line 548/Why would we want to restrict our model?: "trend" -> "tend".
Line 756/Cross-validation to find the best value of $\lambda$: "We can use this new idea to choose a lambda value, by" -> "We can use this new idea to choose a lambda value by"
Line 798/Blending ridge regression and the LASSO - elastic nets: "So far, we've used ridge regression, where alpha = 0, and LASSO regression, where alpha = 1." -> "So far, we've used ridge regression (where alpha = 0) and LASSO regression (where alpha = 1)."

mallewellyn commented 8 months ago

Task list:

[x] 1. Line 22/Key points: I can't see where this last key point re computational speed is discussed. Potentially omit or add a brief comment about this in the text?
[x] 2. Line 46/Introduction: The end of this sentence could clarify the differences between the approach of this episode and the previous one. Something like: "that instead [what regularisation does and how it differs from information sharing]".
[x] 3. Line 79/Introduction: I like the example to show what happens when you fit a model using all features. Could add a sentence like "We explain what singularities are and why they appear when fitting high-dimensional data below" to just clarify that this is addressed!
[x] 4. Line 79/Introduction: Also, a brief explanation of why some effect sizes are very high as this doesn't seem to be addressed.
- [x] 5. Line 84/Singularities: I think it needs to be clearer why high-dimensional data result in singularities (i.e., explicitly saying that determinant of matrix is zero for high-dimensional data, thus singularities are always present when fitting models naively to high-dimensional data and R often cannot fit the model).
[x] 6. Line 87/Singularities: remove comma before "and why are they..."
[x] 7. Line 141/Correlated features -- common in high-dimensional data: Perhaps this summary of the challenges could be placed outside of the pinned box. I think it deserves more attention as it succinctly summarises both of the above points and motivates the need for regularisation.
[x] 8. Lin 141/Correlated features: Also, the fact that p>n is problematic is discussed in a lot of detail (resulting in singularities), and so I think should be added to this point to justify the use of regularisation (or else the discussion on high-dimensional data being problematic by its very size could be removed and just discussion of correlations retained)!

Something like: "Regularisation can help us to deal with correlated features." -> "Regularisation can help us to deal with correlated features, as well as effectively reduce the number of features (dimension) in our model, and thus addresses these issues".

[x] 9. Line 152/Challenge 1: "Discuss in groups:" could be "Consider or discuss in groups:" to make suitable for an individual learner.
[x] 10. Lines 175 & Line 311: Query - it feels, from the section above, as though you are about to describe regularisation as this is now fully motivated. These regression sections therefore confused me a bit. There's a similar interlude in the previous episode. Could it be possible to cover this as preliminary material at the start of the episode, reference another tutorial on linear regression + training and testing models and reference this throughout all episodes, or use this as a "bridging" episode at the start of the formal set of episodes and back-reference?

Could then provide the information re restricting the model after the example (where we try to fit a linear model on the whole data set) as further motivation and explain how this relates to generalisability.

[x] 11. Line 441/Using regularisation to impove generalisability: "This can be done with regularisation. The idea to add another condition to the problem we’re solving with linear regression." -> "This can be done with regularisation: adding another condition to the problem we’re solving with linear regression."
[x] 12. Line 512/Using regularisation to impove generalisability: "punished" -> "punishes"
[x] 13. Line 515/Using regularisation to improve generalisability: A brief few words about the trade-off here with the extent of regularisation would help to clarify.
[x] 14. Line 516/Why would we want to restrict our model?: "This type of regularisation is called ridge regression" makes it sound as though ridge regression=OLS. Slight re-wording to tie ridge regression to the non-zero penalty may help.
[x] 15. Line 518: Somewhere in this section, could possibly add a title to improve signposting of the ridge method (since there's one for lasso).
[x] 16. Line 520: These problems already feel explained and thus I would propose a minor re-wording to make this a recap and demonstration of the fact that regularisation actually works.
[x] 17. Line 548/Why would we want to restrict our model?: "trend" -> "tend".
[x] 18. Line 742/Cross-validation: Haven't defined what a 'good' value for lambda looks like (to define what the 'best' looks like)
[x] 19. Line 749: Cross-validation: "Cross-validation is a really deep topic that we're not going to cover in more detail today, though!" possibly a slight re-wording as the subsequent analysis seems to cover cross-validation in a bit more detail.
[x] 20. Line 756/Cross-validation: "We can use this new idea to choose a lambda value, by" -> "We can use this new idea to choose a lambda value by"
[x] 21. Line 795/Blending ridge regression and the LASSO - elastic nets: Would possibly re-order this title to put elastic nets first if also using lasso and ridge titles (just for signposting).
[x] 22. Line 798/Blending ridge regression and the LASSO - elastic nets: "So far, we've used ridge regression, where alpha = 0, and LASSO regression, where alpha = 1." -> "So far, we've used ridge regression (where alpha = 0) and LASSO regression (where alpha = 1)."
[x] 23. Line 1019/Other types of outcomes: It is a little unclear to me that the intercept-only model is selected. Could annotate the code to demonstrate how this happened.

alanocallaghan commented 8 months ago

Not sure if mentioned but https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L733

sub "sparsity" here for "penalty" or similar, and rephrase more generally

alanocallaghan commented 8 months ago

Going by carpentries principles, challenge 3 here:

https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L358-L404

Shouldn't ask learners to do something new. Instead, we should show them how to calculate MSE on the training data, by moving the MSE code from here: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L413-L420

to the previous code along block here: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L341-L356

then, in the exercise, ask them to do the same things they've done, but with the test data. Otherwise this challenge will have low completion rate and lead to cognitive overload/muddled learning outcomes

alanocallaghan commented 8 months ago

And this code block could be split up to be explained more thoroughly: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L561-L570

as could this one: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L759-L769

mallewellyn commented 8 months ago

I've proposed some changes in response to both of these comments in the commits above :)

carpentries-incubator / high-dimensional-stats-r

Review comments: Episode 3 - regularised regression #115