Closed ewallace closed 11 months ago
I don't know if these are rhetorical, but
Why as.data.frame? Comparing simplerfit_horvath <- lm(train_age ~ train_mat) to the example fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
The second example preserves the variable names as is, so when you use predict with newdata it doesn't throw a warning. Should probably work with a dataframe from the start there
What does the -1 do to the methyl_mat matrix in k-fold cross validation? (in lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)
I'm not 100% but presumably this is removing the intercept as glmnet automatically adds one. Again probably would be better to set the data up so the code is similar across lm and glmnet calls, although I think that's actually rather difficult
@Alanocallaghan thanks, it wasn't rhetorical and sorry for being unclear. I agree that it would be helpful to either set up the code to be more similar, or to explain the details.
The first is mentioned in this issue for a fuller explanation https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/52
Many of these are now implemented now. Others have become obsolete due to restructuring.
DRAFT TO BE UPDATED AFTER DAY 4 - saved here to get started, currently updated to day 3.
EdCarp delivery 2022-09-27 to 2022-09-30, with instructors @hannesbecher, @luciewoellenstein44, @ewallace. https://edcarp.github.io/2022-09-27_ed-dash_high-dim-stats/
Collaborative document: https://pad.carpentries.org/2022-09-27_ed-dash_high-dim-stats
Overall went very well, good material, happy and engaged students.
Day 1 - Introduction, Regression with many features
Learner feedback
Please list 1 thing that you liked or found particularly useful
Please list another thing that you found less useful, or that could be improved
Instructor feedback
Day 2 - Regularised regression
Learner feedback
Please list 1 thing that you liked or found particularly useful
Please list another thing that you found less useful, or that could be improved
Instructor feedback
Learners had several questions about extra arguments in calls to lm(), glmnet(), and so on. See etherpad day 2. Those should give clues to places to simplify:
as.data.frame
? Comparing simplerfit_horvath <- lm(train_age ~ train_mat)
to the examplefit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)
Day 3 - Principal component analyses, Factor analysis
Learner feedback
Please list 1 thing that you liked or found particularly useful
Please list another thing that you found less useful, or that could be improved
Instructor feedback
PCA (Episode 4)
Really nice introductory explanations.
Episode 4 PCA, Challenge 1, example 2 is ambiguous as it could be interpreted as PCA-appropriate. Could that be clarified or discussed?
Challenge 2 some of the students said "seems like a trick question"
Loading is introduced approximately 3 times, but only explained later in the lesson. Could that be rationalised so it's introduced strongly once? Understanding the loadings helps understand how you calculate PCs, and that could come before you decide how many PCs you want to keep
The distance between the plot styles with base plot earlier and ggplot2-based later is striking and perhaps distracting. For example, one biplot looks very different from another biplot. This could also make the code fragile for learners as in the same lesson
biplot
is used forPCAtools::biplot
andstats::biplot
.Are the labels in the biplot needed in PCAtools/microarray example? It seems like unnecessary and distracting information here given we are not going to explain
GSMxxxxx
or211122_s_at
. Also they are hard to read - too small and/or overlapping and give ggrepel error messages.This lesson introduced to me the terms "screeplot" and "biplot" as I didn't have special names for them before. Maybe an extra sentence of explanation each would be helpful.
"Remove the lower 20% of PCs with lower variance" was unclear to learners.
In some code snippets, the comments happening after the code means they appear after the output instead of next to the code they refer to. Maybe more helpful to more the comments immediately before the line of code they refer to?
plotloadings
was unclear to instructors and to learners. We wondered how the included variables chosen, and is it important to include it? Reading the?plotloadings
, it's says that therangeRetain
argument gives a "Cut-off value for retaining variables" in terms of "top/bottom fraction of the loadings range". I (Edward) find that unintuitive. For example there are still many points in 1/10000th of the loadings range:plotloadings(pc, labSize = 3, rangeRetain = 1e-5)
Factor analysis (Episode 5)
Day 4 - K-means clustering, Hierarchical clustering
Learner feedback
Instructor feedback