Open biozzq opened 1 year ago
Pinging @mikelove, who is certainly the person to give the best answer here. One clear difference to note though is that TPM is a length-normalized measure, while CPM is not. This alone means they will exhibit nontrivial differences.
Yup. I see correlations around .8 which seems reasonable. Imagine:
gene X has count 100 and transcript length 1,000 gene Y has count 100 and transcript length 10,000
these have same CPM but order of magnitude difference in TPM
Dear all
Thank you for your prompt reply. @mikelove yes, the CPM is only cross-sample normalisation, but not cross genes. TPM is both cross-sample and cross-gene normalisation.
Thus, in my mind, TPM is more suitable for downstream RNA-seq analysis, including clustering analysis, differential expression testing using Wilcoxon rank-sum test.
Also, for accurately detecting differentially expressed genes, is it reasonable to overlap the results from different methods, such as edgeR+Wilcoxon rank-sum test?
Best regards, Zheng zhuqing
There is a long literature about why we use counts or CPM (in either case, optionally with an effective transcript length offset) instead of raw TPM for statistical modeling. Using TPM throws out information about the sampling variation. It can be recovered in large sample datasets, but in small sample datasets, it's too much information loss.
With respect to Wilcoxon, again, it's good to incorporate the inherent sampling variation of counts into the test statistic even with nonparametric schemes. This occurs in SAMseq (2013)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4605138/
...and also in our method Swish (2019), which is based on SAMseq but designed specifically for output of methods like Salmon.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6765120/
Note that Swish is both 1) nonparametric 2) takes into account the multinomial-based sampling nature of sequencing data 3) also takes into account inferential uncertainty from multimapping reads (across isoforms, alleles, or genes).
Thank you @mikelove . I will try Swish you mentioned above.
Dear all,
I would say that salmon is so fast to report the TPM and read counts for each transcript or gene, and I always use salmon+tximport+edgeR to detect the differentially expressed genes. Because the edgeR can output the normalized read counts and tximport can output TPM for each gene based on the results generated by salmon, I asked the difference between TPM and log2(CPM). From following correlation plot, I found that the samples are clustered by different quantifications, TPM and CPM, but not by samples. Because my RNA-seq experiment contains 7 biological replicates in each of two conditions, I decide to identify differentially expressed genes by using Wilcoxon rank-sum test based on each gene’s TPM or CPM. Also, I can also retain the overlapped differentially expressed genes between edgeR and Wilcoxon rank-sum test. I would like to hear your suggestion.
tpm_cpm_corr-spearman.pdf
Thank you in advance.
Best regards, Zheng zhuqing