correct way to use DESeq after salmon quantification

tamuanand commented 3 years ago

Hi @rob-p

I have a question on the "right" way of tximport/DESeq2 after salmon quant.

Why I ask "right way" - is because I am a bit confused with the tximport vignette

1 - https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor

Do not manually pass the original gene-level counts to downstream methods without an offset. 
The only case where this would make sense is if there is no length bias to the counts, as happens in 3’ tagged RNA-seq data (see section below). 
The original gene-level counts are in txi$counts when tximport was run with countsFromAbundance="no". 
This is simply passing the summed estimated transcript counts, and does not correct for potential differential isoform usage (the offset), which is the point of the tximport methods (Soneson, Love, and Robinson 2015) for gene-level analysis. 
Passing uncorrected gene-level counts without an offset is not recommended by the tximport package authors. 
The two methods we provide here are: “original counts and offset” or “bias corrected counts without an offset”. 
Passing txi to DESeqDataSetFromTximport as outlined below is correct: the function creates the appropriate offset for you to perform gene-level differential expression.

2 - https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Salmon

files <- file.path(dir, "salmon", samples$run, "quant.sf.gz")
names(files) <- paste0("sample", 1:6)
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)
head(txi.salmon$counts)

Why the confusion - https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor - states

The two methods we provide here are: “original counts and offset” or “bias corrected counts without an offset”. Passing txi to DESeqDataSetFromTximport as outlined below is correct: the function creates the appropriate offset for you to perform gene-level differential expression
The second method is to use the tximport argument countsFromAbundance="lengthScaledTPM" or "scaledTPM", and then to use the gene-level count matrix txi$counts directly as you would a regular count matrix with these software. Let’s call this method “bias corrected counts without an offset”

Now, if I were to use the 2nd bullet as guide, shouldn't txi be generated this way for use with DESeq -- see the addition of countsFromAbundance = "lengthScaledTPM" to tximport line

files <- file.path(dir, "salmon", samples$run, "quant.sf.gz")
names(files) <- paste0("sample", 1:6)
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
head(txi.salmon$counts)
write.csv(as.data.frame(txi.salmon$counts), file = "tx2gene_NumReads.csv")

And then use the tx2gene_NumReads.csv with DESeqDataSetFromMatrix, where the countData comes after reading in tx2gene_NumReads.csv upstream in the code. Note: I am using DESeqDataSetFromMatrix here and not DESeqDataSetFromTximport as I have used tximport with countsFromAbundance=lengthScaledTPM

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds

I also saw these 2 links - https://hbctraining.github.io/DGE_workshop_salmon/lessons/07_DGE_summarizing_workflow.html and https://hbctraining.github.io/DGE_workshop_salmon/lessons/01_DGE_setup_and_overview.html

pseudocounts generated by Salmon are represented as normalized TPM (transcripts per million) counts and map to transcripts. These need to be converted into non-normalized count estimates for performing DESeq2 analysis. To use DESeq2 we also need to collapse our abundance estimates from the transcript level to the gene-level.
note - the 1st link uses DESeqDataSetFromTximport after using tximport to get txi with countsFromAbundance=lengthScaledTPM

Thanks in advance,

rob-p commented 3 years ago

Thanks @tamuanand for the (as always) detailed and clear question! Since this directly involves tximport and DESeq2 downstream, let me also ping @mikelove here.

tamuanand commented 3 years ago

Thanks @rob-p and Thanks in advance @mikelove

The original question pertained to using salmon with say ILMN RNA-Seq followed by DGE with DESeq2

@rob-p - I will also use this opportunity to indulge myself on a related question (how to use salmon with QuantSeq and then downstream with DESeq2). I have asked many QuantSeq related questions on this GH forum and I am yet to find the correct recipe for using salmon with quantseq and downstream DGE

https://github.com/COMBINE-lab/salmon/issues/449#issuecomment-565474848
https://github.com/COMBINE-lab/salmon/issues/365#issuecomment-499732849
and many others (do not want to get into a infinite loop here :)

@rob-p @mikelove - Here is my thought process (for salmon-QuantSeq-DESeq):

I know salmon has the --noLengthCorrection feature and the help text says it is "experimental" for QuantSeq
Probably, I should not use --noLengthCorrection feature when running salmon quant and just get the counts.

One might be wondeing why not to use --noLengthCorrection as it was introduced by @rob-p exclusively for QuantSeq -- that idea is based on what I see on the tximport vignette for 3' tagged RNA-seq which has this to state

If you have 3’ tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, 
because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, 
we recommend to use the original counts, e.g. txi$counts as a counts matrix, e.g. 
providing to DESeqDataSetFromMatrix or to the edgeR or limma functions 
without calculating an offset and without using countsFromAbundance.

Let me know if you would approach the salmon-QuantSeq-DESeq puzzle differently.

Thanks in advance.

mikelove commented 3 years ago

Just to check: do you have length bias in your data (are counts roughly proportional to effective transcript length)?

tamuanand commented 3 years ago

@mikelove - was this question with reference to salmon quant on QuantSeq data?

Just to check: do you have length bias in your data (are counts roughly proportional to effective transcript length)?

@rob-p Is there a way to get the answer to Mike's question from the meta_info.json files. Also, aren't the counts in quant.sf file provided after taking into account length bias and effective transcript length?

This is the salmon quant command line being used for RNA-Seq quantification - still not figured out the right command line combination for QuantSeq data

salmon --no-version-check quant --threads 16 --seqBias --validateMappings --numBootstraps 100 .......

The original question in the post is "what are the correct steps with tximport for running DESEQ after salmon quant"

mikelove commented 3 years ago

The standard is the code chunk in the vignette:

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
# then below...
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

Or even better, you can use tximeta:

se <- tximeta(coldata)
gse <- summarizeToGene(se)
dds <- DESeqDataSet(gse, ~condition)

If you have a special protocol which does not involve fragmentation of a full length transcript, then you do something else. But if you are fragmenting molecules and sequencing from along the entire transcript, use those code chunks from the vignette.

tamuanand commented 3 years ago

Thanks @mikelove

I believe tximeta can be used only for human/mouse? In my case, it is not human/mouse

@rob-p and @mikelove - Based on my reading of the salmon documentation, isn't it that the NumReads/TPM etc made available after lengthCorrection. Extending this, the NumReads in quant.sf corresponds to the estimated count value for each transcript and correlated by effective length. My idea is to therefore use the countsFromAbundance=“lengthScaledTPM” to compute counts that are on the same scale as original counts and not correlated with transcript length across samples.

Given this - Is this below also valid (after salmon quant)

salmon_tx2gene_data = tximport(files, type="salmon", tx2gene=tx2gene,
                                       countsFromAbundance = "lengthScaledTPM")

# generate CSV for archival/use-for-other-purposes 
# then read in the csv and use with DESeq

write.csv(as.data.frame(salmon_tx2gene_data$counts),
                  file = "lengthScaledTPM_tx2gene_counts.csv")

# other code for reading in csv, design_metadata etc

dds <- DESeqDataSetFromMatrix (countData = salmon_tx2gene_data$counts,
                              colData = coldata, ~ condition)

mikelove commented 3 years ago

To keep the code simpler:

txi <- tximport(files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
# then below...
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

DESeq2 will do the right thing based on the value of txi$countsFromAbundance. This is the point of the importer functions. We also have them in tximeta for edgeR and limma.

(You can still use tximeta with organisms other than human, mouse, or fly, you just have to run makeLinkedTxome and point to the GTF for your organism. It's just one step really.)

tamuanand commented 3 years ago

@mikelove

Using your code snippet, what's the subtle difference between using DESeqDataSetFromTximport and DESeqDataSetFromMatrix -- or is it that there is no difference

# DESeqDataSetFromTximport
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

# DESeqDataSetFromMatrix 
dds <- DESeqDataSetFromMatrix (countData = txi$counts,
                              colData = coldata, ~ condition)

Basis of this particular question of mine (with use of DESeqDataSetFromMatrix in the latter code block above):

https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor
The second method is to use the tximport argument countsFromAbundance="lengthScaledTPM" or "scaledTPM", and then to use the gene-level count matrix txi$counts directly as you would a regular count matrix with these software.

mikelove commented 3 years ago

No difference. I only prefer people use ...Tximport() only because we had some people using txi$counts alone and not using the countsFromAbundance argument, and then calling that the "tximport" method, which was making others confused. That's all.

tamuanand commented 3 years ago

Thanks @mikelove

we had some people using txi$counts alone and not using the countsFromAbundance argument

Based on the above, I assume that doing something like this is wrong as DESeqDataSetFromMatrix is being used after countsFromAbundance = "no"

txi = tximport(files, type="salmon", tx2gene=tx2gene,
                                       countsFromAbundance = "no")

dds <- DESeqDataSetFromMatrix (countData = txi$counts,
                              colData = coldata, ~ condition)

@rob-p and @mikelove -- While on this topic, how would you use salmon quant and DESeq2 for QuantSeq data (which would be 3' tagged RNA-seq)? Would you use salmon quant without --noLengthCorrection or would you usesalmon quant with --noLengthCorrection

call salmon quant as before (and do not use --noLengthCorrection) and then do as suggested/stated in these 2 links
- https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor and https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#3%E2%80%99_tagged_RNA-seq
- Do not manually pass the original gene-level counts to downstream methods without an offset. The only case where this would make sense is if there is no length bias to the counts, as happens in 3’ tagged RNA-seq data (see section below). The original gene-level counts are in txi$counts when tximport was run with countsFromAbundance="no".
- If you have 3’ tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. txi$counts as a counts matrix , e.g. providing to DESeqDataSetFromMatrix

mikelove commented 3 years ago

It seems this is unambiguous from the documentation, right? Is there any question left about what to do?

I can’t think how to explain it in sentences that are different from the above.

tamuanand commented 3 years ago

It seems this is unambiguous from the documentation, right? Is there any question left about what to do?

I can’t think how to explain it in sentences that are different from the above.

@mikelove @rob-p My specific question (probably to @rob-p ) was on how to use salmon before I use DESeq2

should I use salmon with --noLengthCorrection and then use txi$counts without offset

iichelhadi commented 3 years ago

This might be unrelated to the topic but I am a bit confused as towhich column values does tximport use from the quant.sf to generate the counts matrix (TPM or NumReads)? Also I had another question after running tximport as in the code chuck below: txi <- tximport(files, countsFromAbundance = 'scaledTPM', type = "salmon", tx2gene = tx2gene )

I found that some genes have 'NA' as values and some are '0'. could someone explain this to me please? thank you in advance.

mikelove commented 3 years ago

TPM is an abundance measure and becomes "abundance" in the list of matrices.

NumReads is an estimated count and becomes "counts" in the list of matrices.

Each software has different names for these, you can browse in the code here:

https://github.com/mikelove/tximport/blob/master/R/tximport.R#L355-L358

iichelhadi commented 3 years ago

TPM is an abundance measure and becomes "abundance" in the list of matrices.

NumReads is an estimated count and becomes "counts" in the list of matrices.

Each software has different names for these, you can browse in the code here:

https://github.com/mikelove/tximport/blob/master/R/tximport.R#L355-L358

thank you

COMBINE-lab / salmon

correct way to use DESeq after salmon quantification #581