Review comments: Episode 1 - introduction to high-dimensional data

mallewellyn commented 9 months ago

Overall, I really like this as an introduction to the course. I think it strikes a good balance between motivating the lesson clearly while avoiding overwhelming a learner with information. I have listed some comments below. In the most part, I think these can be addressed by re-ordering sentences and adding additional signposting. I've also proposed some minor changes at the very bottom.

I will submit pull requests for these changes but feel free to reject changes where appropriate, of course!

Line 53/What are high-dimensional data?: Sentence "Such data sets pose a challenge for data analysis as standard methods of analysis, such as linear regression, are no longer appropriate." Possibly a brief few words to foreshadow that this is discussed later or why this is an issue would help to motivate the lesson more clearly.
Lines 41-66/What are high-dimensional data?: I think these paragraphs are very informative but perhaps a little cyclical in places since each paragraph starts with a more specific use case of the above and ends with the challenges. It may be clearer (not definitely!) if the applications are first described (getting more descriptive as is) and then the challenges are described to keep the flow of ideas. The challenges could even come after the initial plot of the data.
Line 107/Challenges in dealing with high-dimensional data: I like this as a heading, but given we first see "Challenge" with reference to problems for the learners to try directly above, the use of "Challenge" again in this title initially confused me a little. Maybe a synonym for challenge would help in this title.
Line 112/Challenges in dealing with high-dimensional data: Sentences in the paragraph from "This is because, ...". I think these sentences nicely explain why high-dimensional data are so prevalent, but as such I think it should be moved to the section above "What are high-dimensional data". It feels as though the first sentence is starting to explain why methods for high-dimensional data analysis are challenging (don't have as many tools historically) which feels relevant to the section, but the next parts go on to explain why this exists. If moving the explanation for the existence of high-dimensional data, the paragraph could also then pin down what the overall challenge is with high-dimensional data (even if this is a list of the things described later in the lesson like less developed or impossible visualisation) just to clearly motivate what's to come.
Line 140/Challenge 2: Given that this is the first bit of R code in the episode, a comment could be added to make sure people have completed the setup instructions. I think this is probably particularly useful for independent learners. We could also consider extending this to every first R challenge of each episode (I'm thinking that this would be consistent with the Carpentries' consideration for independent learners).

Second, is there any way we can use a high-dimensional data set here? It feels confusing to talk about high-dimensional data and the challenges and use this as an example of the challenges in high-dimensional data (as stated in the text above). I understand the point completely after looking at the problem and solution, but it's not immediately clear to me.

Line 144/Challenge 2: Possibly a new line for each part a, b and c of this challenge. Should part c also end with a question like "What problem with high-dimensional data analysis does this illustrate?" to get people thinking?
Line 188: I think the example used in this section is really clear. However, I'm a little confused if this section is related to the challenge we discussed above re visualising lots of variables being difficult, or if it's a distinct issue. Some text here either differentiating the issues or linking may be useful. If differentiating between the two, titles/subtitles/paragraph titles could also help signpost and make clear.
Line 210: This really feels like the biggest problem and is explained below, but maybe more link between the challenge and first text here would help. Something like "Let's explore why high correlations might be an issue in a Challenge".
Line 287/What statistical methods are used to analyse high-dimensional data?: I'm confused how the challenges described in this sentence relate to what we've just discussed. Further, some list the cause of the challenge (e.g. high correlation), while some list the effect (over-fitting). Maybe just summarising the causes first and then effect just to reinforce what was just discussed: difficult to visualise leading to challenges identifying suitable response variables, more features than observations leading to over-fitting, correlations between variables causing challenges including over-fitting (I don't think this latter point is actually explained above but could be useful to explain there. Also there may be more challenges with correlation between variables, hence the wording with that one). Having some sort of change re Line 188 could also help clarify.

Also - "can be difficult due to..." rather than "is difficult due to: " may be more accurate since there are other issues we've not considered yet. Also, not sure about this, but is the use of the colon correct in the latter phrase?

Some minor comments:

Line 53/What are high-dimensional data?: LaTeX formatting of the inequality of "$p$>=$n$" (-> "$p>=n$")
Line 57/What are high-dimensional data?: "Subjects like genomics and medical sciences often use both tall (in terms of $n$) and wide (in terms of $p$) datasets that can be difficult to analyse or visualise using standard statistical tools." Is it more precise to say "large $n$" and "large $p$" here since n and p can't themselves be tall and wide respectively.
Line 77/Challenge 1: "Which of these are considered to have high-dimensional data?" Should this be "Which of these scenarios use high-dimensional data?"
Line 289/ What statistical methods are used to analyse high-dimensional data?: "In this course we will cover four topics:" suggest a slight re-wording to make it clear that these are methods for dealing with high-dimensional data "In this course, we will cover four methods that help in dealing with high-dimensional data:"
Line 322/What statistical methods are used to analyse high-dimensional data?: hyphenation of "high dimensional datasets" (-> "high-dimensional datasets")
Line 348 & 367/Using Bioconductor to access high-dimensional data in the biosciences: minfi is loaded twice
Alt text and captions for figures.

Happy to discuss :)

mallewellyn commented 9 months ago

Converted to task list for tracking. Displayed as task with associated comment below.

[x] 1. Line 53: Add brief few words to foreshadow discussed later/why this is an issue.

Sentence "Such data sets pose a challenge for data analysis as standard methods of analysis, such as linear regression, are no longer appropriate." Possibly a brief few words to foreshadow that this is discussed later or why this is an issue would help to motivate the lesson more clearly.

[x] 2. Lines 41-66: Re-order to applications then challenges. Consider challenges after initial plot.

I think these paragraphs are very informative but perhaps a little cyclical in places since each paragraph starts with a more specific use case of the above and ends with the challenges. It may be clearer (not definitely!) if the applications are first described (getting more descriptive as is) and then the challenges are described to keep the flow of ideas. The challenges could even come after the initial plot of the data.

[x] 3. Line 107: Synonym for 'challenge' in the title

I like this as a heading, but given we first see "Challenge" with reference to problems for the learners to try directly above, the use of "Challenge" again in this title initially confused me a little. Maybe a synonym for challenge would help in this title.

[x] 4. Line 112: Move description of high-dimensional data prevalence to section above and describe overall set of challenges.

Sentences in the paragraph from "This is because, ...". I think these sentences nicely explain why high-dimensional data are so prevalent, but as such I think it should be moved to the section above "What are high-dimensional data". It feels as though the first sentence is starting to explain why methods for high-dimensional data analysis are challenging (don't have as many tools historically) which feels relevant to the section, but the next parts go on to explain why this exists. If moving the explanation for the existence of high-dimensional data, the paragraph could also then pin down what the overall challenge is with high-dimensional data (even if this is a list of the things described later in the lesson like less developed or impossible visualisation) just to clearly motivate what's to come.

[x] 5. Line 140: add comment to complete setup instructions.

Given that this is the first bit of R code in the episode, a comment could be added to make sure people have completed the setup instructions. I think this is probably particularly useful for independent learners. We could also consider extending this to every first R challenge of each episode (I'm thinking that this would be consistent with the Carpentries' consideration for independent learners).

[x] 6. Line 140: Use fewer observations or else clarify the reason for using a low-dimensional data set.

Second, is there any way we can use a high-dimensional data set here? It feels confusing to talk about high-dimensional data and the challenges and use this as an example of the challenges in high-dimensional data (as stated in the text above). I understand the point completely after looking at the problem and solution, but it's not immediately clear to me.

[x] 7. Line 144: New lines for parts a, b and c.
[x] 8. Line 144: Add "What problem with high-dimensional data analysis does this illustrate?" to the end of part c.

Possibly a new line for each part a, b and c of this challenge. Should part c also end with a question like "What problem with high-dimensional data analysis does this illustrate?" to get people thinking?

[x] 9. Line 188: Link or differentiate between issues described.

I think the example used in this section is really clear. However, I'm a little confused if this section is related to the challenge we discussed above re visualising lots of variables being difficult, or if it's a distinct issue. Some text here either differentiating the issues or linking may be useful. If differentiating between the two, titles/subtitles/paragraph titles could also help signpost and make clear.

[x] 10. Line 210: Add "Let's explore why high correlations might be an issue in a Challenge".

This really feels like the biggest problem and is explained below, but maybe more link between the challenge and first text here would help. Something like "Let's explore why high correlations might be an issue in a Challenge".

[x] 11. Line 287: Clarify how challenges relate to what we've just discussed.
[x] 12. Line 287: separate cause and effect in phrasing.

I'm confused how the challenges described in this sentence relate to what we've just discussed. Further, some list the cause of the challenge (e.g. high correlation), while some list the effect (over-fitting). Maybe just summarising the causes first and then effect just to reinforce what was just discussed: difficult to visualise leading to challenges identifying suitable response variables, more features than observations leading to over-fitting, correlations between variables causing challenges including over-fitting (I don't think this latter point is actually explained above but could be useful to explain there. Also there may be more challenges with correlation between variables, hence the wording with that one). Having some sort of change re Line 188 could also help clarify.

[x] 13. Line 287: "is difficult due to:" -> "can be difficult due to"

Also - "can be difficult due to..." rather than "is difficult due to: " may be more accurate since there are other issues we've not considered yet. Also, not sure about this, but is the use of the colon correct in the latter phrase?

Minor comments

[x] 14. Line 53: LaTeX formatting of the inequality of "$p$>=$n$" (-> "$p>=n$")
[x] 15. Line 57: -> "tall" and "wide" -> "large"

"Subjects like genomics and medical sciences often use both tall (in terms of ) and wide (in terms of ) datasets that can be difficult to analyse or visualise using standard statistical tools." Is it more precise to say "large " and "large" here since n and p can't themselves be tall and wide respectively.

[x] 16. Line 77: "Which of these are considered to have high-dimensional data?" -> "Which of these scenarios use high-dimensional data?"

"Which of these are considered to have high-dimensional data?" Should this be "Which of these scenarios use high-dimensional data?"

[x] 17. Line 289: "In this course we will cover four topics:" -> "In this course, we will cover four methods that help in dealing with high-dimensional data:"

"In this course we will cover four topics:" suggest a slight re-wording to make it clear that these are methods for dealing with high-dimensional data "In this course, we will cover four methods that help in dealing with high-dimensional data:"

[x] 18. Line 322: "high dimensional datasets" -> "high-dimensional datasets"
[x] 19. Line 348 & 367: remove one loading of minfi

minfi is loaded twice

[x] 20. Figure captions

alanocallaghan commented 9 months ago

Two points come to mind from going through the materials:

the definition of high-dimensional data is a bit weird. p = 5,000,000 and n = 5,000,001 is not high dimensional data...? Challenge 3 is a bit unclear. Why are the univariate models sig and the multi non-sig? What if you pick another pair of variables? eg lcp and svi?

mallewellyn commented 9 months ago

Yes, totally agree. Re the definition of high-dimensional data, it's also not consistent throughout. In my opinion, any reasonable definition is 'ok', as long as it's clear, consistent and the methods are relevant. Maybe simply defining high-dimensional data as having a large number of variables does the job? What do you think?

Happy to string changes to challenge 3 at the end of the open pull request and we can iterate on that one.

mallewellyn commented 9 months ago

[x] 21. clear and consistent definition of high-dimensional data
[x] 22. clarify Challenge 3

alanocallaghan commented 9 months ago

Yeah, I think that definition is fine.

The reason in Challenge 3 is I guess just shared attribution of the marginal effects; we don't know which is the true association, so the effect (and significance) is just shared. Good motivator for regularisation later

alanocallaghan commented 9 months ago

Also should say cor > 0.75 to ensure learners pick the same pair

alanocallaghan commented 9 months ago

Also, this is a super weird model generally. Is gleason score causing people's age to increase...?

mallewellyn commented 9 months ago

Have added some explanation to the narrative following challenge 3 + changed the correlation threshold in #124

Do you know if there's a description of the Prostate data somewhere? Seems sensible from what I understand about it that it would be positively associated with age if a general patient data set but no idea based on what's there at the moment

alanocallaghan commented 9 months ago

Prostate data is from here IIRC: https://www.rdocumentation.org/packages/lasso2/versions/1.2-22/topics/Prostate Unfortunate there's not more info; may be better to swap for a better documented dataset...

Have you any thoughts on the time estimates on the lesson pages? My feeling is these are all massively underestimated; would be good to get these to be more realistic when moving out of alpha

mallewellyn commented 9 months ago

I definitely don't have the domain knowledge to be able to tell either way from that description, but possibly worth swapping if it's even a bit unclear.

In any case, I think your data page idea would massively help here also! Have opened this in #132 (will obviously try to help create that as much as possible too).

Re the time estimates, I think the overall distribution seems reasonable. I would consider possibly allocating some of the hierarchical clustering time to the factor analysis episode. Andrzej will probably have some really useful insights on this though because he's leading the current round of teaching. I'll reach out to him :)

mallewellyn commented 9 months ago

Episode timings to be reviewed formally in current round of teaching

carpentries-incubator / high-dimensional-stats-r

Review comments: Episode 1 - introduction to high-dimensional data #112