Open TomSmithCGAT opened 2 years ago
I'm aware that introducing DESeq2DataSet
at this point might be considered too early and over-complicated. Would be simple enough to do all the above just from tximport
output obviously. Just a few more lines for the single gene plotting and keep 'ggfortify' for the PCA.
I like the idea of introducing the DESeqDataSet
early on, since that's the main data structure to become familiar with.
I don't think it's necessary to interact with the txi
object directly.
I like this kind of approach:
assay(dds, "rlog") <- assay(rlog(dds))
To include the normalised counts within the main object. So that they can saveRDS()
the object for other downstream analysis later on.
This way you could do the exploration by fetching the matrices as assay(dds, "counts")
for raw counts or assay(dds, "rlog")
for normalised counts.
For example, this is what the popular workflow nf-core/rnaseq
does, creating a DESeqDataSet
object with those assays (and a null design design(dds) ~ 1
), so if users or their bioinfo core use that workflow for their analysis, they will then be able to follow on from our course and apply all the downstream analysis we cover.
Incidentally, doing the rlog
from the DESeqDataSet
object takes into account transcript length normalisation factors (generated by salmon and stored in assay(dds, "avgTxLength")
from objects created by DESeqDataSetFromTximport()
- see estimateSizeFactors()
docs), whereas the way we are doing it now (directly from the txi$counts
matrix) does not.
Exploring the expression of a single gene seems very useful too. It maybe requires a bit of gymnastics, but probably worth covering. I would usually do something like:
assay(dds[c("gene1", "gene2"), ], "rlog") |>
as_tibble(rownames = "gene") |>
pivot_longer(-gene, names_to = "sample", values_to = "expr") |>
left_join(as_tibble(colData(dds)), by = "sample") |>
ggplot(aes(treatment, expr)) +
geom_jitter() +
facet_wrap(~ gene)
Which can probably be broken into simpler steps for teaching. Or maybe there's a simpler way of doing this?
Alot of what you suggest is actually covered in the later sections (deseq2 objects/hierarchical clustering/heatmaps etc.). In fact a huge chunk at the start of the deseq2 section is devoted to the parts that make up the deseq2 object and how it works. I am open to discussing maybe moving some of it earlier though if people think it would be more useful but my feeling is that this section is for the little qc/explorations you do before you get to deseq2. There was a section of the visualisation section which deals with looking at a single gene but so much had to be removed to make it fit into teaching remotely unfortunately it got cut but we have it in supplementary if we wanted to bring it back. I worry that diving into their favourite gene too early might increase the likelihood of cherrypicking outcomes but I guess if there is a control gene they definitely know should do something...? Food for thought...
I do take your point on transformations, although I didn't feel like I spent much time on it at all last time I taught that bit lol! Maybe we could rationalise the materials here. Personally I have also been thinking that this section needs a refresh but I didn't have time between Feb when I taught it last and now.
I don't think its an issue packages get introduced just for one task (although incidentally I think those are also in later sections) as thats the nature of how things work in the real world and this isn't a R beginners course.
I'll set up a debrief meeting so we can get everyone's feedback on any changes we want to make over the summer break but if you have anymore thoughts keep adding them here/as issues its good to have a record.
I think hierachical clustering and sample correlations (both in the sup materials for this notebook), would complement the PCA, and give the participants another angle for the data exploration to understand if the experiment has 'worked', which seems like it should be the major aim of this section to me.
Whether to do this from a DESeqDataSet
object though? :thinking: I can definitely both sides on that one.
I would only suggest adding in the single gene example if we want to demonstrate the effect of the transformation more clearly, since I think many participants will find that helpful. Agree that we might not want them to focus on a single gene at this stage though.
The 'RNA-seq Data Exploration' notebook covers:
vst
rlog
(exercise)My concerns are that
ggfortify
,ggrepel
)vst
andrlog
).I would favour the following, with some of the below potentially being recovered from the 'Additional Data Exploration' notebook.
DESeq2DataSet
, including exploring data structure (with explanation that thedesign
argument will be fully explored in a later session)DESeq2::plotCounts
)pheatmap
should be suffficient, as suggested in DESeq2 vignette, or alternatively, just base Rhclust
, with appropriate sample naming)rlog
DESeq2::plotPCA