ewallace commented 1 year ago

DRAFT TO BE UPDATED AFTER DAY 4 - saved here to get started, currently updated to day 3.

EdCarp delivery 2022-09-27 to 2022-09-30, with instructors @hannesbecher, @luciewoellenstein44, @ewallace. https://edcarp.github.io/2022-09-27_ed-dash_high-dim-stats/

Collaborative document: https://pad.carpentries.org/2022-09-27_ed-dash_high-dim-stats

Overall went very well, good material, happy and engaged students.

Day 1 - Introduction, Regression with many features

Learner feedback

Please list 1 thing that you liked or found particularly useful

Well, all this is exactly what I need right now for my work. So, it was all very useful. (very useful help on the model.matrix - thank you!! (Pete)
It's nice to go through every function/word in R and know what they mean all the time.
Very helpful, especially in explaining what each part of the function actually means
Great learning experience
Very useful and insightful first day!
I really appreciate getting a chance to go through the code step by step. It's useful to be able to hear what it is exactly, and how it works.

Please list another thing that you found less useful, or that could be improved

While this is out of your control, moving between windows and internet tabs on a small screen takes a little time so, from time to time I missed something. (Pete) +1
Sometimes it is hard to read the material in time for the group sessions. +1
maybe more breaks to people could catch up +1
Perhaps a glossary/definitions of functions used could be useful in case you miss anything that has been spoken
I spent a bit of time trying to find the column header for the smoking exercise! Should have checked the question first, but I didn't and wasted loads of time trying to figure out it was $smoking.

Instructor feedback

Day 2 - Regularised regression

Learner feedback

Please list 1 thing that you liked or found particularly useful

the detailed explaination of regression models, from ridge to Lasso and eslatic, it is just fantastic to know how those algorithms relate to each other, been using them for many years, never understanded the links the coding and visulisation of the results are really helpful.
The depth of the models and the background was great. +1
Very happy with the explanations of how the maths works. Also it's great to be finally able to make a predictive model, even if it was very simple.
Increased my understanding of regression, but it was a tough day! Lots to take in.

Please list another thing that you found less useful, or that could be improved

There were a few times I was confused with the R syntax being used, mainly because I am not used to them. Are there any supporting documents that could be displayed for some of the exercises to help us solve the tasks?
Although in contridiction to my "positive" comment, it was heavy going :) - although I did enjoy it. The material is there for us to go back over. +1
I think its good to do some examples. I struggled to keep up at times and got a little lost. I think this is just my lack of familiarity. Maybe a little slower would be good.
I found it tough going, and there was a lot of detail. Felt a bit out of my depth at times, but I did learn a bit more.

Instructor feedback

Learners had several questions about extra arguments in calls to lm(), glmnet(), and so on. See etherpad day 2. Those should give clues to places to simplify:

Why as.data.frame? Comparing simplerfit_horvath <- lm(train_age ~ train_mat) to the example fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))
What does the -1 do to the methyl_mat matrix in k-fold cross validation? (in lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)

Day 3 - Principal component analyses, Factor analysis

Learner feedback

Please list 1 thing that you liked or found particularly useful

I thought this lesson was explained really well! I finally understand what these two models do. The run-through in R was in depth and really helpful.
Very practical lesson! Easy to follow.
For me this was great as I kind of do this sort of thing anyway. So, to actually be taught it filled in some gaps. The course materail, as everyday, is excellent. Very detailed. Course delivery excellent too.
perfect level for me today - I've used PCA in genetics to look for relatedness, so had a bit of understanding into how it works, but didn't know how to use it on non-genetic data. Really helpful, and I get it now!! I can see how to use it in my research.
likewise - only ever used PCA for pop genomics as a bit of a black box so great to develop my understanding. v interested in factor analysis
Fantastic, I can see the material here coming in very good use!
very interesting and detailed explaination on PCA and factor analysis, love it

Please list another thing that you found less useful, or that could be improved

difficult one: Maybe the time for coding could be expanded slightly?
I am curious about factor analysis and would be great to discuss it more

Instructor feedback

PCA (Episode 4)

Really nice introductory explanations.
Episode 4 PCA, Challenge 1, example 2 is ambiguous as it could be interpreted as PCA-appropriate. Could that be clarified or discussed?

An online retailer has collected data on user interactions with its online app and has information on the number of times each user interacted with the app, what products they viewed per interaction, and the type and cost of these products. The retailer would like to use this information to predict whether or not a user will be interested in a new product.
Challenge 2 some of the students said "seems like a trick question"
Loading is introduced approximately 3 times, but only explained later in the lesson. Could that be rationalised so it's introduced strongly once? Understanding the loadings helps understand how you calculate PCs, and that could come before you decide how many PCs you want to keep
The distance between the plot styles with base plot earlier and ggplot2-based later is striking and perhaps distracting. For example, one biplot looks very different from another biplot. This could also make the code fragile for learners as in the same lesson biplot is used for PCAtools::biplot and stats::biplot.
Are the labels in the biplot needed in PCAtools/microarray example? It seems like unnecessary and distracting information here given we are not going to explain GSMxxxxx or 211122_s_at. Also they are hard to read - too small and/or overlapping and give ggrepel error messages.
This lesson introduced to me the terms "screeplot" and "biplot" as I didn't have special names for them before. Maybe an extra sentence of explanation each would be helpful.
"Remove the lower 20% of PCs with lower variance" was unclear to learners.
In some code snippets, the comments happening after the code means they appear after the output instead of next to the code they refer to. Maybe more helpful to more the comments immediately before the line of code they refer to?
plotloadings was unclear to instructors and to learners. We wondered how the included variables chosen, and is it important to include it? Reading the ?plotloadings, it's says that the rangeRetain argument gives a "Cut-off value for retaining variables" in terms of "top/bottom fraction of the loadings range". I (Edward) find that unintuitive. For example there are still many points in 1/10000th of the loadings range: plotloadings(pc, labSize = 3, rangeRetain = 1e-5)

Factor analysis (Episode 5)

There's some confusion about the difference between PCA and FA. Current introduction says "we introduce another method", "Factor analysis is used to identify latent features in a dataset from among a set of original variables ... FA does this in a similar way to PCA", "Unlike with PCA, researchers using FA have to specify the number of latent variables.". This overall gives the impression of "similar but different" and doesn't explain well either why you'd need to learn both or the ideas underlying the difference.
Some online materials give clearer PCA vs FA explanations, e.g. https://towardsdatascience.com/what-is-the-difference-between-pca-and-factor-analysis-5362ef6fa6f9 and https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/203-30.pdf
Still the learners seemed very happy. That seems to reflect the hands-on approach of the lesson that they can follow along with, less mathy than previous episodes.

Day 4 - K-means clustering, Hierarchical clustering

Learner feedback

Instructor feedback

alanocallaghan commented 1 year ago

I don't know if these are rhetorical, but

Why as.data.frame? Comparing simplerfit_horvath <- lm(train_age ~ train_mat) to the example fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat))

The second example preserves the variable names as is, so when you use predict with newdata it doesn't throw a warning. Should probably work with a dataframe from the start there

What does the -1 do to the methyl_mat matrix in k-fold cross validation? (in lasso <- cv.glmnet(methyl_mat[, -1], age, alpha = 1)

I'm not 100% but presumably this is removing the intercept as glmnet automatically adds one. Again probably would be better to set the data up so the code is similar across lm and glmnet calls, although I think that's actually rather difficult

ewallace commented 1 year ago

@Alanocallaghan thanks, it wasn't rhetorical and sorry for being unclear. I agree that it would be helpful to either set up the code to be more similar, or to explain the details.

alanocallaghan commented 1 year ago

The first is mentioned in this issue for a fuller explanation https://github.com/carpentries-incubator/high-dimensional-stats-r/issues/52

hannesbecher commented 11 months ago

Many of these are now implemented now. Others have become obsolete due to restructuring.

carpentries-incubator / high-dimensional-stats-r

Feedback from September 2022 delivery #88

Day 1 - Introduction, Regression with many features

Learner feedback

Instructor feedback

Day 2 - Regularised regression

Learner feedback

Instructor feedback

Day 3 - Principal component analyses, Factor analysis

Learner feedback

Instructor feedback

PCA (Episode 4)

Factor analysis (Episode 5)

Day 4 - K-means clustering, Hierarchical clustering

Learner feedback

Instructor feedback