getzlab / rnaseqc

Fast, efficient RNA-Seq metrics for quality control and process optimization
Other
150 stars 19 forks source link

median TPM of gene differs from median TPM of its singleton transcript in GTEx v8 #48

Closed cb4github closed 4 years ago

cb4github commented 4 years ago

Dear Folks,

I hope all is well, and thanks for all your efforts.

In the transcript expression file, GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_tpm.gct.gz, there is only one transcript, namely ENST00000367976.3, for ENSEMBL gene ENSG00000118523.5 (a.k.a. CTGF), and when I extract the TPM values for said transcript and tissue type 'Artery - Aorta', the resulting median TPM is 935.8 for n=432 (non-zero values).

Correspondingly, in the gene expression file, GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_tpm.gct.gz, when I extract the TPM values for said gene and tissue type 'Artery - Aorta', the resulting median TPM is 2043 for n=432 (non-zero values).

Also, please see the attached - and apparently quite similar - violin plots (grouped by donor's age bracket) for the gene and transcript TPM values, respectively.

I've looked at the code briefly, and please excuse that I have yet to explain the this difference (by a factor of ~2.2) of median TPM 2043 for the gene CTGF from 935.8 for the singleton transcript ENST00000367976.3.

Please advise, thanks. Best, CB Rplot.CTGF.ArteryAorta.22_10_20.pdf Rplot.ENST00000367976.ArteryAorta.12_10_20.pdf

francois-a commented 4 years ago

Hi, it's not quite clear how this issue is related to RNA-SeQC. If you're asking about differences between RNA-SeQC gene-level expression and RSEM transcript-level estimates in GTEx, please contact the GTEx Portal.

cb4github commented 4 years ago

From separate email, it was explained that the RNA-SeQC TPM estimates are based on a simple normalization by gene length, whereas RSEM attempts to correct for additional biases in read coverage. These two aspects can result in relatively large differences for some genes. Many thanks!