Too much focus on transformation in the RNA-Seq exploration section

TomSmithCGAT commented 2 years ago

The 'RNA-seq Data Exploration' notebook covers:

Reading in the salmon quant and sample info, including exploring data structure
Importing scaled TPM (excercise)
Filtering the genes
Raw count distributions
log2 transformation
vst
rlog (exercise)
PCA
Plotting different PCs (exercise)
Reveal sample swap and update sample info

My concerns are that

Much of the material is focused on data transformation, but ultimately, only the rlog is used, and the only exploration of the data is the PCA.
At no point do we explore the expression of a single gene, which I expect will be a more intuitive way to explore the data for many participants
2 extra packages are introduced just for the PCA (ggfortify, ggrepel)
The introduction beforehand focuses on DESeq2, but they don't use this package and explore the object structure until much later (excepting vst and rlog).

I would favour the following, with some of the below potentially being recovered from the 'Additional Data Exploration' notebook.

Reading in the salmon quant and sample info
Creating a DESeq2DataSet, including exploring data structure (with explanation that the design argument will be fully explored in a later session)
Exploring raw counts by looking at the overall distibution and one example gene (DESeq2::plotCounts)
Repeat 3 but for log transformed counts (exercise)
Hierachical clustering of raw counts (a few lines with pheatmap should be suffficient, as suggested in DESeq2 vignette, or alternatively, just base R hclust, with appropriate sample naming)
Repeat 6 with log-transformed counts to demonstrate the value of transformation (exercise; separation should now be clearer)
Reveal sample swap and update sample info
rlog
PCA using DESeq2::plotPCA
What does the PCA 'say' (discussion)

TomSmithCGAT commented 2 years ago

I'm aware that introducing DESeq2DataSet at this point might be considered too early and over-complicated. Would be simple enough to do all the above just from tximport output obviously. Just a few more lines for the single gene plotting and keep 'ggfortify' for the PCA.

tavareshugo commented 2 years ago

I like the idea of introducing the DESeqDataSet early on, since that's the main data structure to become familiar with. I don't think it's necessary to interact with the txi object directly.

I like this kind of approach:

assay(dds, "rlog") <- assay(rlog(dds))

To include the normalised counts within the main object. So that they can saveRDS() the object for other downstream analysis later on. This way you could do the exploration by fetching the matrices as assay(dds, "counts") for raw counts or assay(dds, "rlog") for normalised counts.

For example, this is what the popular workflow nf-core/rnaseq does, creating a DESeqDataSet object with those assays (and a null design design(dds) ~ 1), so if users or their bioinfo core use that workflow for their analysis, they will then be able to follow on from our course and apply all the downstream analysis we cover.

Incidentally, doing the rlog from the DESeqDataSet object takes into account transcript length normalisation factors (generated by salmon and stored in assay(dds, "avgTxLength") from objects created by DESeqDataSetFromTximport() - see estimateSizeFactors() docs), whereas the way we are doing it now (directly from the txi$counts matrix) does not.

Exploring the expression of a single gene seems very useful too. It maybe requires a bit of gymnastics, but probably worth covering. I would usually do something like:

assay(dds[c("gene1", "gene2"), ], "rlog") |> 
  as_tibble(rownames = "gene") |> 
  pivot_longer(-gene, names_to = "sample", values_to = "expr") |> 
  left_join(as_tibble(colData(dds)), by = "sample") |> 
  ggplot(aes(treatment, expr)) +
  geom_jitter() +
  facet_wrap(~ gene)

Which can probably be broken into simpler steps for teaching. Or maybe there's a simpler way of doing this?

AbbiEdwards commented 2 years ago

Alot of what you suggest is actually covered in the later sections (deseq2 objects/hierarchical clustering/heatmaps etc.). In fact a huge chunk at the start of the deseq2 section is devoted to the parts that make up the deseq2 object and how it works. I am open to discussing maybe moving some of it earlier though if people think it would be more useful but my feeling is that this section is for the little qc/explorations you do before you get to deseq2. There was a section of the visualisation section which deals with looking at a single gene but so much had to be removed to make it fit into teaching remotely unfortunately it got cut but we have it in supplementary if we wanted to bring it back. I worry that diving into their favourite gene too early might increase the likelihood of cherrypicking outcomes but I guess if there is a control gene they definitely know should do something...? Food for thought...

I do take your point on transformations, although I didn't feel like I spent much time on it at all last time I taught that bit lol! Maybe we could rationalise the materials here. Personally I have also been thinking that this section needs a refresh but I didn't have time between Feb when I taught it last and now.

I don't think its an issue packages get introduced just for one task (although incidentally I think those are also in later sections) as thats the nature of how things work in the real world and this isn't a R beginners course.

I'll set up a debrief meeting so we can get everyone's feedback on any changes we want to make over the summer break but if you have anymore thoughts keep adding them here/as issues its good to have a record.

TomSmithCGAT commented 2 years ago

I think hierachical clustering and sample correlations (both in the sup materials for this notebook), would complement the PCA, and give the participants another angle for the data exploration to understand if the experiment has 'worked', which seems like it should be the major aim of this section to me.

Whether to do this from a DESeqDataSet object though? :thinking: I can definitely both sides on that one.

I would only suggest adding in the single gene example if we want to demonstrate the effect of the transformation more clearly, since I think many participants will find that helpful. Agree that we might not want them to focus on a single gene at this stage though.

bioinformatics-core-shared-training / Bulk_RNAseq_Course_Base

Too much focus on transformation in the RNA-Seq exploration section #10