carpentries-incubator / high-dimensional-stats-r

High-dimensional statistics with R
https://carpentries-incubator.github.io/high-dimensional-stats-r
Other
12 stars 18 forks source link

Review comments: Episode 3 - regularised regression #115

Closed mallewellyn closed 8 months ago

mallewellyn commented 8 months ago

Episode 3

Another nice episode. Although it's quite long, I think it covers what are often quite challenging ideas in a very approachable way. Most of the comments I have relate to how regularisation is motivated and some minor re-wording to clarify. I do, however, have a more challenging query about the placement of the linear regression background information. I have highlighted this in bold below!

Again, I will submit pull requests where possible and very happy to discuss anything.

Also, a brief explanation of why some effect sizes are very high as this doesn't seem to be addressed.

Also, the fact that p>n is problematic is discussed in a lot of detail (resulting in singularities), and so I think should be added to this point to justify the use of regularisation (or else the discussion on high-dimensional data being problematic by its very size could be removed and just discussion of correlations retained)!

Something like: "Regularisation can help us to deal with correlated features." -> "Regularisation can help us to deal with correlated features, as well as effectively reduce the number of features (dimension) in our model, and thus addresses these issues".

Could then provide the information re restricting the model after the example (where we try to fit a linear model on the whole data set) as further motivation and explain how this relates to generalisability.

Minor comments

mallewellyn commented 8 months ago

Task list:

Something like: "Regularisation can help us to deal with correlated features." -> "Regularisation can help us to deal with correlated features, as well as effectively reduce the number of features (dimension) in our model, and thus addresses these issues".

Could then provide the information re restricting the model after the example (where we try to fit a linear model on the whole data set) as further motivation and explain how this relates to generalisability.

alanocallaghan commented 8 months ago

Not sure if mentioned but https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L733

sub "sparsity" here for "penalty" or similar, and rephrase more generally

alanocallaghan commented 8 months ago

Going by carpentries principles, challenge 3 here:

https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L358-L404

Shouldn't ask learners to do something new. Instead, we should show them how to calculate MSE on the training data, by moving the MSE code from here: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L413-L420

to the previous code along block here: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L341-L356

then, in the exercise, ask them to do the same things they've done, but with the test data. Otherwise this challenge will have low completion rate and lead to cognitive overload/muddled learning outcomes

alanocallaghan commented 8 months ago

And this code block could be split up to be explained more thoroughly: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L561-L570

as could this one: https://github.com/carpentries-incubator/high-dimensional-stats-r/blob/57f2f5b2d6dbb9c8f190542e34da2ba0979acd73/_episodes_rmd/03-regression-regularisation.Rmd#L759-L769

mallewellyn commented 8 months ago

I've proposed some changes in response to both of these comments in the commits above :)