2020 Intro to R revamp - Githubissues

jashapiro commented 4 years ago

Overview

Our current training materials for the Intro to R start from basic operations, to data types and vectors, slowly building to data frames and graphics. The purpose of this issue is to outline a partial reversal of that ordering: starting with loading real data (as a data frame) and plotting it, then gradually breaking down the high level operations to reveal some of the power of the tidyverse, and some of the underlying methods that might be useful to researchers as they proceed through their own analyses.

Outline

Show the goal

A paragraph describing the data set we will be using, and a fancy plot of that data goes at the top. Let them see what they are aiming for over the course of the first module. Brainstorm what steps we need in nontechnical language. i.e. We need where the points are, what colors they will be, calculated statistics, etc. What are the things that software is doing automatically that they are used to taking for granted?

Basic Basics

Parts of RStudio Interface
Console
- Console as calculator (1+1)
- vectors
- vector operations: c(1, 2, 3, 4) * 5
- Variables
- numbers and strings only
- difference between a bare word and a quoted string
- Functions
- %>% (might save this for later)
Notebook basics

This list excludes some of the things we currently cover early. I would suggest these should be introduced in context, if/as needed. This includes things like str(), NULL, factors, and subsets. Factors we should make sure comes up in later modules or late in this one. Other things, I am not so sure. Tidyverse verbs will do much of what they need, but some of the logical vector filtering and the like could potentially come up later.

First Data

The first data set should be something that is relatively large, with a number of columns that can be plotted, filtered on, colored, etc. The canonical example is gapminder, but we should have something that is more biologically focused.

One good candidate I had thought of was a summary of some gene expression tests: Pairwise comparisons across some sample groupings, possibly a factorial design. So this would be something like gene expression between multiple drug treatments and control samples, or treatment/control X genetic background. The data we would want could have something of like the following form:

gene	group1	group2	n1	n2	mean_expr1	mean_expr2	log_fold_change	pvalue

This should be provided as a .csv or .tsv table, which would be read in with readr::read_csv() or equivalent.

We might note at this point that each column is a single type of data, but still not belabor types at this point. The key point to get across is that this is a simple data table, spreadsheet even, with no fancy features, just rows with the same kinds of data in each, and they are perfectly rectangular (aside from possible missing data).

Brief summaries of columns. Mean, median, etc. (No dplyr::summarize yet, though perhaps something like skimr::skim)

First Plot

Scatter (volcano) plot of fold change using ggplot(). The initial plot will use defaults, only plotting fold change vs. -log10(pvalue). All genes from all comparisons will be included. Getting log10(pvalue) will introduce dplyr::mutate()

This is obviously not what we we want, so the next step is to recreate after dplyr::filter to just one comparison.

Add aesthetics: transparency to show overplotting (show how adding this in geom_point affects all equally), color by mean expression level (in aes() to change by value), etc.

Once this single plot is created, use facet_wrap() and the full data set to show how easy it is to make multiples of the same plot for exploration.

Followup

There is a lot we miss or elide in this quest to get to a pretty figure fast. Take the time now to add in things like variable names, coding style, packages, rm() to get rid of unused large data frames, navigating directories, etc.

We can also dive deeper into tidyverse now. Describe tidy data in more detail, show summaries by group, joining tables, etc.

Fill in whatever other gaps are needed...

cansavvy commented 4 years ago

Per our meeting, here's the next steps

[ ] 1) Find good expression dataset for example and get feedback.
[ ] 2) Post a proposed figure here and get feedback.
[ ] 3) Work backwards from the code for this figure and data and post to this issue a proposed outline of concepts that the intro to R revised module will cover.
[ ] 4) File a series of PRs that follow the concepts outline.
[ ] 5) File some other PRs that introduce concepts "just in time" in the other modules.

Optionally, later...

[ ] Any concepts that are not covered above that we think might still be useful to participants with less R experience can be added to an optional "practice the concepts" notebook that would probably be largely exercise based.

cansavvy commented 4 years ago

Today I'm going to look for a dataset to use for Intro to R revamp. I'm planning to use refine.bio data unless there's a reason I shouldn't. I think these are some general criteria we want, let me know if there is anything you would add to this list or if any of these priorities should not be as emphasized. @jashapiro @jaclyn-taroni

Criteria to look for datasets by (in order of priority):

Has clean experimental design (1 or 2 clear variables to build a model around).
Probably 10-20 samples? (Big enough to do something with but not so big that we encounter memory problems off the bat)?
Is either microarray or RNA-seq but probably the more common platform the better.
Needs publication results we can compare against.
Being pediatric cancer relevant is best, but should definitely be cancer related.
Organism should be human and/or mouse. (don't want to be having them use data from a not well known genome/ practicing with genomes they are most likely to use is best).

jaclyn-taroni commented 4 years ago

I would keep in mind that how we process RNA-seq data in refine.bio is slightly different from what we do in training currently. So if you find a dataset you like, you may want to ask the engineering team if you can get the RDS of the tximport output for that experiment (we should have them around but they are not user facing).

cansavvy commented 4 years ago

I've found some potential datasets that fit the criteria I set above, I'm going to make some example ggplot2 volcano plots (maybe PCA plot? to check for batch effects?) with them and see which still seem good and I'll post which datasets I still think are contenders here.

I've asked the dev team to send me tximport files so we can look at the data at that stage.

cansavvy commented 4 years ago

I know we talked volcano plots, but generally I would just go straight to limma for that and I don't think taking them straight to a package helps reveal how R/tidyverse works. How do we feel about making a replicate of a simple, one gene boxplot qPCR figure from a paper? We can do that step by step ggplot2 alterations tutorial type thing (that is currently being done in the machine-learning module). (I didn't add p values to this but we could do that)

Plot I made with microarray data:

The paper's qPCR plot:

Associated paper itself: lambert2013.pdf

jashapiro commented 4 years ago

I don't really like the single gene view for this case. Yes, limma can do the volcano plot, but the point here is to introduce a bit more about ggplot, with the ability to extend it. Limma should be able to output the kind of results table I described, so that is still where I think we should be aiming. The idea here is not to do a full analysis, but to have a good, but slightly complex, tidy data set to work from.

And you know how I feel about boxplots.

jaclyn-taroni commented 4 years ago

I typed this up before @jashapiro reply showed up for me:

A volcano plot has the benefit of being a scatter plot, so you need to specify x and y in the aesthetic, and I believe @jashapiro is proposing that we color the points (likely by significance?). Our plot choice here, using ggplot2, doesn’t need to be guided by available functionality for making that plot (e.g., limma). I think another benefit of using a table of differential gene expression results is that is something that will show up again in RNA-Seq Exercises.

cansavvy commented 4 years ago

So if I'm understanding this correctly: 1) Read in dataset and metadata 2) Make model and get results table with limma 2) Walk through some data.frame manipulations with tidyverse? 3) Make volcano plot using ggplot2 layer by layer?

Alternatively, do we want to start them off with a limma results table (aka skip 1), just read in a table we've prepped directly and go straight to data.frame manipulations and ggplot2?

jaclyn-taroni commented 4 years ago

Yes, you want to skip the modeling and getting results and provide them with the results table that they read in.

jashapiro commented 4 years ago

Yup. Skip steps 1 and 2. Start with a well formatted data table. Get right to plotting.

cansavvy commented 4 years ago

How do we feel about this plot @jashapiro? I have a basic set up for taking it step by step to get to this plot: volcano_plot

cansavvy commented 4 years ago

@jashapiro I set up a draft PR #164 so we can discuss this first notebook as quick as possible. Its no where near a finished product, just there as reference so you can see what I'm setting this up as so far.

jashapiro commented 4 years ago

That plot looks fine, but the results are kinda bananas. Those tiny p values with small changes are things we may want to filter out from the data set.

cansavvy commented 4 years ago

That plot looks fine, but the results are kinda bananas. Those tiny p values with small changes are things we may want to filter out from the data set.

I will add a section that can teach about dplyr::filter and plays around with some cutoffs.

jashapiro commented 4 years ago

I will add a section that can teach about dplyr::filter and plays around with some cutoffs.

I’m hesitant to do that here. I’d rather have the data clean at this stage. Adding a column and filtering to one comparison I think is sufficient for now.

cansavvy commented 4 years ago

I’m hesitant to do that here. I’d rather have the data clean at this stage. Adding a column and filtering to one comparison I think is sufficient for now.

Perhaps moving forward it would be easiest if we decide what our learning objectives are for this notebook and then I can construct the notebook around it? Do you have some initial idea of the list of concepts that should be covered and what order? This can obviously be a reiterative process.

cansavvy commented 4 years ago

Plan moving forward:

1) Borrow ideas from https://github.com/sjspielman/datascience_for_biologists/ 2) Make an outline of a plan based on that. 3) Implement said outline/plan and make the changes to #164

cansavvy commented 4 years ago

@jashapiro , I posted a suggested outline on #165 Feel free to make suggestions or edit directly. I can post to a word doc if that his easier for you to make edits on. But as soon as you think it is mostly there, I will go ahead and make drafts of each notebook.

cansavvy commented 4 years ago

What is left for Intro to R updates is reflected in https://github.com/AlexsLemonade/exercise-notebook-answers/issues/17 https://github.com/AlexsLemonade/training-modules/issues/175 and https://github.com/AlexsLemonade/training-modules/issues/179

So this issue can be closed!

AlexsLemonade / training-modules

2020 Intro to R revamp #162

Overview

Outline

Show the goal

Basic Basics

First Data

First Plot

Followup

Per our meeting, here's the next steps

Optionally, later...

Criteria to look for datasets by (in order of priority):

Plan moving forward: