ailithewing commented 2 years ago

A list of proposed changes following the May delivery of HDS

These are in addition to the changes in the pull request ailith_delivery3 and to the changes that Hannes made that have yet to be pushed to the main course materials.

Throughout

[x] bold package names and include () for functions

Intro

[x] Change high-dimensional data definition
[x] Switch out prostate dataset or make it much clearer that it's a toy dataset for the purposes of explanation
[x] Change view to head and dim
[x] Expand challenge 1 solution
[x] More specific question than examine the dataset in challenge 2 (from Emma's review in #39)
[x] Check how we're referring to figures e.g not by number if there's no number
[x] Could add a challenge question to show what happens with correlated variables (see Emma's review in #39)
[x] Take out bioconductor intro as we never teach it (maybe condense and put in a callout box?)
[x] Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
[x] Explain why you are using here? (from Emma's review in #39)
[x] STRUCTURAL Challenges section focus on two things: (a) ill-defined model (more predictors than observations) can add figure with one dot only, and (b) correlated predictors perhaps add code and show unstable coefficient estimates.
[x] STRUCTURAL Rewirtre section on which statistical methods are used to give an overview of the course. Focus on problems and what analysis is used when (exploring one outcome with many similar features (methylation/expression) / predicting outcomes with more features than observations / reducing dimensionality/grouping/making sense of similar predictors / clustering observations)

Regression with many features (many outcomes)

[x] rank results in toptable by effect size
[x] include small intro to feature selection to motivate why these techniques are useful as we took the feature selection lesson out of the 2-day course.
[x] check exercises aren't introducing new concepts
[x] check direction of smoker is consistent between model and plot
[x] Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
[x] Explore whether the episode can be made shorter or divided (from Emma's review in #47)
[x] Add a reference for the source of the methylation data
[x] Change title to regression with many outcomes and add a brief comment to distinguish between dealing with many outcomes and/or many features (we can mention that the regularisation episode will address that). Potentially, we can create a separate episode Regression in high-dimensional settings where we introduce the methylation data and the two different types of problems. However, this is outside the scope for this round of changes. Creating this separate episode would also address some of Emma's concerns.
[x] Add mention of dream() from VariancePartition which is similar to limma but can handle grouping (random effects)

Regularisation

[x] needs split up
- motivation & rationale - in expanded intro
- intro to model selection/cross validation
- what is regularisation in general?
- ridge and lasso
[x] more explanation of Horvath
[ ] greater figure explanation in the materials
[x] fix overuse of Xi
[ ] more detail on extracting coefficients and model interpretation
[ ] glossary of jargon
[x] add link to ML course for related materials (from #7)

CAV (20220206) Link added to episode 1 instead as it's general across different types of ML approaches.
[x] review plot labels (from #7)

CAV (20220206) I can't recall what the specific issue was, but the episode has been extensively revised and labels look ok.
[x] review phrasing in "why would we...?" - Alan marked it as convoluted (from #7)

CAV (20220206) Paragraph was revised, so hopefully OK now.
[x] review ridge/EN equations (partially from #7)

CAV (20220206) Notation review.
[x] in exercise 2, maybe ask why mean squared rather than sum of squared (from #7)
[x] Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
[x] move up the section "Using regularisation to impove generalisability"
[x] add reason for training and test intro, like: "Before we move on to regularised regression, we have to introduce..."
[x] when talking about elastic net, say we've used it all along - lasso and ridge are special cases with alpha=0/1

PCA

[x] consider removing scaling from gene expression pca (include box about gene expression normalisation to emphasise that that's not what we're talking about)
[x] Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
[x] Equation half way down needed at all (which refers to original exaple?
[x] add note the PCAtools taks data in the Bioconductor orientation
[x] STRUCTURAL add table comparing terms for loadings and scores used in different packages

FA

[x] move advantages and disadvantages of FA up so it's in the introduction
[x] more detail on communality and uniqueness
[x] mention confirmatory factor analysis
[x] discuss ways of determining number of factors
[x] Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)

K means

[x] fix border=NA in sil plot
[x] check the coloured blocks on bootstrapping the clusters (check set.seed)
[x] include this for silhouette scores https://medium.com/@cmukesh8688/silhouette-analysis-in-k-means-clustering-cefa9a7ad111
[x] exercise 1 bugged (from #7; unsure if still needed)
[x] initial mcq before callout? (from #7; unsure if still needed)
[x] title for 1st practical bit (from #7; unsure if still needed)
[x] formal description of silhouette width (from #7; unsure if still needed)
[x] k of 5 or k=5, not both (from #7; unsure if still needed)
[x] title for introducing bootstrap (from #7; unsure if still needed)
[x] title for applying bootstrap (from #7; unsure if still needed)
[x] more detail on bootstrap (from #7; unsure if still needed)
[x] Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)

Hierarchical clusters

[x] Check there's not a confusing switch between clustering features and clustering observations
[x] Add brackets for function names in text, e.g. pairs() (from Emma's review in #39)
[x] improve exercise with linkage method perhaps add examples? : https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec
[x] STRUCTURAL add material on when to use which linkage method

Other

[x] Consider temporarily removing optional episodes until reviewed/edited.
[ ] Edit setup.md to indicate approx time based on RStudio cloud (~30 mins) (from #34)
[ ] Check whether the list in dependencies.csv can be reduced (see #34)
[ ] Test setup.md in different environments. (see #34)
[ ] Create a docker with setup.md?

ailithewing commented 2 years ago

@catavallejos @nathansam @hwarden162 @Alanocallaghan Please add any additional things that I've missed.

nathansam commented 2 years ago

kmeans: set seed for heatmap code chunk starting library("pheatmap") (which might be covered by the coloured blocks to do)

hannesbecher commented 2 years ago

Challenge 1 in episode 1. Not sure about question 4. Is this a good example of high-dim data? Because it is one observation and so many features?

Predicting probability of a patient's cancer progressing using gene expression data from 20,000 genes, as well as data associated with general patient health (age, weight, BMI, blood pressure) and cancer growth (tumour size, localised spread, blood test results).

alanocallaghan commented 2 years ago

Changing that challenge from singular to plural patients would also be good to avoid implying high precision from generic prediction models (ie precision med hype)

hannesbecher commented 2 years ago

Current uniqueness/communality explanations contradicts Wikipedia I think: https://en.wikipedia.org/wiki/Factor_analysis#Terminology

alanocallaghan commented 2 years ago

One way of reducing the number of dep packages is to move all the data wrangling stuff to a data package and then just remotes::install_github it.

hannesbecher commented 1 year ago

Glossary still open, but covered by issue #89

carpentries-incubator / high-dimensional-stats-r

Third delivery suggested changes #64