COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
772 stars 162 forks source link

log(CPM) and TPM are so different #812

Open biozzq opened 1 year ago

biozzq commented 1 year ago

Dear all,

I would say that salmon is so fast to report the TPM and read counts for each transcript or gene, and I always use salmon+tximport+edgeR to detect the differentially expressed genes. Because the edgeR can output the normalized read counts and tximport can output TPM for each gene based on the results generated by salmon, I asked the difference between TPM and log2(CPM). From following correlation plot, I found that the samples are clustered by different quantifications, TPM and CPM, but not by samples. Because my RNA-seq experiment contains 7 biological replicates in each of two conditions, I decide to identify differentially expressed genes by using Wilcoxon rank-sum test based on each gene’s TPM or CPM. Also, I can also retain the overlapped differentially expressed genes between edgeR and Wilcoxon rank-sum test. I would like to hear your suggestion.

y <- DGEList(counts=data, group=group, genes=genelength) # the genelength is generated by salmon+tximport for each sample 
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
logcpm <- cpm(y, log=TRUE, prior.count=1)

tpm_cpm_corr-spearman.pdf

Thank you in advance.

Best regards, Zheng zhuqing

rob-p commented 1 year ago

Pinging @mikelove, who is certainly the person to give the best answer here. One clear difference to note though is that TPM is a length-normalized measure, while CPM is not. This alone means they will exhibit nontrivial differences.

mikelove commented 1 year ago

Yup. I see correlations around .8 which seems reasonable. Imagine:

gene X has count 100 and transcript length 1,000 gene Y has count 100 and transcript length 10,000

these have same CPM but order of magnitude difference in TPM

biozzq commented 1 year ago

Dear all

Thank you for your prompt reply. @mikelove yes, the CPM is only cross-sample normalisation, but not cross genes. TPM is both cross-sample and cross-gene normalisation.

Thus, in my mind, TPM is more suitable for downstream RNA-seq analysis, including clustering analysis, differential expression testing using Wilcoxon rank-sum test.

Also, for accurately detecting differentially expressed genes, is it reasonable to overlap the results from different methods, such as edgeR+Wilcoxon rank-sum test?

Best regards, Zheng zhuqing

mikelove commented 1 year ago

There is a long literature about why we use counts or CPM (in either case, optionally with an effective transcript length offset) instead of raw TPM for statistical modeling. Using TPM throws out information about the sampling variation. It can be recovered in large sample datasets, but in small sample datasets, it's too much information loss.

With respect to Wilcoxon, again, it's good to incorporate the inherent sampling variation of counts into the test statistic even with nonparametric schemes. This occurs in SAMseq (2013)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4605138/

...and also in our method Swish (2019), which is based on SAMseq but designed specifically for output of methods like Salmon.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6765120/

Note that Swish is both 1) nonparametric 2) takes into account the multinomial-based sampling nature of sequencing data 3) also takes into account inferential uncertainty from multimapping reads (across isoforms, alleles, or genes).

biozzq commented 1 year ago

Thank you @mikelove . I will try Swish you mentioned above.