Closed jashapiro closed 4 years ago
Today I'm going to look for a dataset to use for Intro to R revamp. I'm planning to use refine.bio data unless there's a reason I shouldn't. I think these are some general criteria we want, let me know if there is anything you would add to this list or if any of these priorities should not be as emphasized. @jashapiro @jaclyn-taroni
I would keep in mind that how we process RNA-seq data in refine.bio is slightly different from what we do in training currently. So if you find a dataset you like, you may want to ask the engineering team if you can get the RDS of the tximport output for that experiment (we should have them around but they are not user facing).
I've found some potential datasets that fit the criteria I set above, I'm going to make some example ggplot2 volcano plots (maybe PCA plot? to check for batch effects?) with them and see which still seem good and I'll post which datasets I still think are contenders here.
I've asked the dev team to send me tximport files so we can look at the data at that stage.
I know we talked volcano plots, but generally I would just go straight to limma for that and I don't think taking them straight to a package helps reveal how R/tidyverse works. How do we feel about making a replicate of a simple, one gene boxplot qPCR figure from a paper? We can do that step by step ggplot2 alterations tutorial type thing (that is currently being done in the machine-learning module). (I didn't add p values to this but we could do that)
Plot I made with microarray data:
The paper's qPCR plot:
Associated paper itself: lambert2013.pdf
I don't really like the single gene view for this case. Yes, limma can do the volcano plot, but the point here is to introduce a bit more about ggplot, with the ability to extend it. Limma should be able to output the kind of results table I described, so that is still where I think we should be aiming. The idea here is not to do a full analysis, but to have a good, but slightly complex, tidy data set to work from.
And you know how I feel about boxplots.
I typed this up before @jashapiro reply showed up for me:
A volcano plot has the benefit of being a scatter plot, so you need to specify x and y in the aesthetic, and I believe @jashapiro is proposing that we color the points (likely by significance?). Our plot choice here, using ggplot2
, doesn’t need to be guided by available functionality for making that plot (e.g., limma
). I think another benefit of using a table of differential gene expression results is that is something that will show up again in RNA-Seq Exercises.
So if I'm understanding this correctly: 1) Read in dataset and metadata 2) Make model and get results table with limma 2) Walk through some data.frame manipulations with tidyverse? 3) Make volcano plot using ggplot2 layer by layer?
Alternatively, do we want to start them off with a limma results table (aka skip 1), just read in a table we've prepped directly and go straight to data.frame manipulations and ggplot2?
Yes, you want to skip the modeling and getting results and provide them with the results table that they read in.
Yup. Skip steps 1 and 2. Start with a well formatted data table. Get right to plotting.
How do we feel about this plot @jashapiro? I have a basic set up for taking it step by step to get to this plot:
@jashapiro I set up a draft PR #164 so we can discuss this first notebook as quick as possible. Its no where near a finished product, just there as reference so you can see what I'm setting this up as so far.
That plot looks fine, but the results are kinda bananas. Those tiny p values with small changes are things we may want to filter out from the data set.
That plot looks fine, but the results are kinda bananas. Those tiny p values with small changes are things we may want to filter out from the data set.
I will add a section that can teach about dplyr::filter
and plays around with some cutoffs.
I will add a section that can teach about
dplyr::filter
and plays around with some cutoffs.
I’m hesitant to do that here. I’d rather have the data clean at this stage. Adding a column and filtering to one comparison I think is sufficient for now.
I’m hesitant to do that here. I’d rather have the data clean at this stage. Adding a column and filtering to one comparison I think is sufficient for now.
Perhaps moving forward it would be easiest if we decide what our learning objectives are for this notebook and then I can construct the notebook around it? Do you have some initial idea of the list of concepts that should be covered and what order? This can obviously be a reiterative process.
1) Borrow ideas from https://github.com/sjspielman/datascience_for_biologists/ 2) Make an outline of a plan based on that. 3) Implement said outline/plan and make the changes to #164
@jashapiro , I posted a suggested outline on #165 Feel free to make suggestions or edit directly. I can post to a word doc if that his easier for you to make edits on. But as soon as you think it is mostly there, I will go ahead and make drafts of each notebook.
What is left for Intro to R updates is reflected in https://github.com/AlexsLemonade/exercise-notebook-answers/issues/17 https://github.com/AlexsLemonade/training-modules/issues/175 and https://github.com/AlexsLemonade/training-modules/issues/179
So this issue can be closed!
Overview
Our current training materials for the Intro to R start from basic operations, to data types and vectors, slowly building to data frames and graphics. The purpose of this issue is to outline a partial reversal of that ordering: starting with loading real data (as a data frame) and plotting it, then gradually breaking down the high level operations to reveal some of the power of the tidyverse, and some of the underlying methods that might be useful to researchers as they proceed through their own analyses.
Outline
Show the goal
A paragraph describing the data set we will be using, and a fancy plot of that data goes at the top. Let them see what they are aiming for over the course of the first module. Brainstorm what steps we need in nontechnical language. i.e. We need where the points are, what colors they will be, calculated statistics, etc. What are the things that software is doing automatically that they are used to taking for granted?
Basic Basics
1+1
)c(1, 2, 3, 4) * 5
%>%
(might save this for later)This list excludes some of the things we currently cover early. I would suggest these should be introduced in context, if/as needed. This includes things like
str()
,NULL
, factors, and subsets. Factors we should make sure comes up in later modules or late in this one. Other things, I am not so sure. Tidyverse verbs will do much of what they need, but some of the logical vector filtering and the like could potentially come up later.First Data
The first data set should be something that is relatively large, with a number of columns that can be plotted, filtered on, colored, etc. The canonical example is gapminder, but we should have something that is more biologically focused.
One good candidate I had thought of was a summary of some gene expression tests: Pairwise comparisons across some sample groupings, possibly a factorial design. So this would be something like gene expression between multiple drug treatments and control samples, or treatment/control X genetic background. The data we would want could have something of like the following form:
This should be provided as a
.csv
or.tsv
table, which would be read in withreadr::read_csv()
or equivalent.We might note at this point that each column is a single type of data, but still not belabor types at this point. The key point to get across is that this is a simple data table, spreadsheet even, with no fancy features, just rows with the same kinds of data in each, and they are perfectly rectangular (aside from possible missing data).
Brief summaries of columns. Mean, median, etc. (No
dplyr::summarize
yet, though perhaps something likeskimr::skim
)First Plot
Scatter (volcano) plot of fold change using
ggplot()
. The initial plot will use defaults, only plotting fold change vs.-log10(pvalue)
. All genes from all comparisons will be included. Getting log10(pvalue) will introducedplyr::mutate()
This is obviously not what we we want, so the next step is to recreate after
dplyr::filter
to just one comparison.Add aesthetics: transparency to show overplotting (show how adding this in
geom_point
affects all equally), color by mean expression level (inaes()
to change by value), etc.Once this single plot is created, use
facet_wrap()
and the full data set to show how easy it is to make multiples of the same plot for exploration.Followup
There is a lot we miss or elide in this quest to get to a pretty figure fast. Take the time now to add in things like variable names, coding style, packages,
rm()
to get rid of unused large data frames, navigating directories, etc.We can also dive deeper into
tidyverse
now. Describe tidy data in more detail, show summaries by group, joining tables, etc.Fill in whatever other gaps are needed...