MaayanLab / archs4

ARCHS4 RNA-seq processing scripts and web server pages.
Other
54 stars 10 forks source link

Unexpected number of transcripts for mouse #7

Closed apredeus closed 3 years ago

apredeus commented 5 years ago

Hello @lachmann12,

I was wondering why do mouse transcripts quantification only has 98,492 rows? Metadata says you've used Ensembl v90, which has 131,195 unique transcripts in the GTF file, and 109,282 in the cDNA file provided by Ensembl. Was there any additional filtering?

Thank you!

lachmann12 commented 5 years ago

Hi,

The supported genomes are Ensembl Homo sapiens GRCh38 with the GRCh38.87 annotation file, and Mus Musculus GRCm38 with the GRCm38.88 annotation file. We will correct the metadata entry on the next update which should be coming soon.

Best, Alex

apredeus commented 5 years ago

Hello Alex,

GRCm38.88 means Ensembl version 88, right? The numbers still don't quite add up - there's about 125k unique transcripts there. If you select protein-coding transcripts only, there's about 92k. That's why I'm asking - it seems like some sort of filtering was done on the reference, but it's not clear what kind of filtering exactly.

Thanks again!

lachmann12 commented 5 years ago

It might be something that the quantification algorithm is doing (kallisto). I am currently benchmarking multiple aligners and they are not returning the same number of transcripts even though they are using the same cdna file. For the Ensembl 96 annotation kallisto returns 188754 transcripts and salmon 177035. For ARCHS4 we do not apply any specific filtering.

apredeus commented 5 years ago

That's pretty strange - in my experience kallisto returns precisely the same number of transcripts that were used when building the index. There might be some sort of a file format problem - Ensembl often names the sequences using extremely long names.

At any rate, let me know if you find out who was the culprit. Thank you for a great tool.

-- Alex

apredeus commented 5 years ago

Hello @lachmann12,

I'm looking a bit closer into it and it seems like the annotation has some strange problems. First, the list of used transcripts misses about 5000 protein coding transcripts from Ensembl 88, with about 150 protein coding genes missing completely (e.g. Mrln - ENSMUSG00000019933, has 5 transcripts, but none are in the annotation)

Second, ARCHS4 annotation has some IDs that are not in any recent Ensembl annotations; e.g. ENSMUST00000019932 was retired as of Ensembl 86, according to this: http://www.ensembl.org/Mus_musculus/Transcript/Idhistory?db=core;t=ENSMUST00000019932

I think it would be good to clarify the methods used to create the Kallisto reference.

Thank you for the answers, I appreciate you taking the time and looking into this.

itszhengan commented 4 years ago

Hello @lachmann12,

I'm looking a bit closer into it and it seems like the annotation has some strange problems. First, the list of used transcripts misses about 5000 protein coding transcripts from Ensembl 88, with about 150 protein coding genes missing completely (e.g. Mrln - ENSMUSG00000019933, has 5 transcripts, but none are in the annotation)

Second, ARCHS4 annotation has some IDs that are not in any recent Ensembl annotations; e.g. ENSMUST00000019932 was retired as of Ensembl 86, according to this: http://www.ensembl.org/Mus_musculus/Transcript/Idhistory?db=core;t=ENSMUST00000019932

I think it would be good to clarify the methods used to create the Kallisto reference.

Thank you for the answers, I appreciate you taking the time and looking into this.

Hi @apredeus

Recently I also notice this situation. I compare archs4 with ensembl v87 and I found out there is some transcripts that don't belong to ensembl v87 annotation in archs4 human transcript-level tpm matrix. It's strange because I tried many re-processing pipeline such as grein and archs4. They seem to have the same problems. I'm very doubtful about these data's accuracy.

Best, Zheng

lachmann12 commented 4 years ago

Hi Zheng,

thanks for pointing this out. The information in the TPM data is the raw output of the kallisto algorithm. I think that the problem lies with the alignment algorithm used. As to why it would remove certain transcripts I am not sure. It is strange because Grein is using Salmon, which is very similar to kallisto, but developed by a different group. Could you tell me what transcripts are missing?

Best, Alex

itszhengan commented 4 years ago

Hi Zheng,

thanks for pointing this out. The information in the TPM data is the raw output of the kallisto algorithm. I think that the problem lies with the alignment algorithm used. As to why it would remove certain transcripts I am not sure. It is strange because Grein is using Salmon, which is very similar to kallisto, but developed by a different group. Could you tell me what transcripts are missing?

Best, Alex

Hi Alex,

Thank for your quick reply. I checked human_tpm_v8.h5 from https://amp.pharm.mssm.edu/archs4/download.html and it has 178136 transcript names. However when I check ensembl v87 and it has 197935 transcripts. I also compare the difference and found out 15141 transcripts don't appear in ensembl v87 gtf annotation but in your data (such as "ENST00000635399" "ENST00000635416" "ENST00000635517" "ENST00000635007" "ENST00000635114" "ENST00000634476" "ENST00000634687"), and 34940 transcripts don't appear in your data but in ensembl v87 gtf annotation (such as "ENST00000635571" "ENST00000631169" "ENST00000625548" "ENST00000422198" "ENST00000451090" "ENST00000425914" "ENST00000585367"). Could you please check? Than you.

Best, Zheng

apredeus commented 4 years ago

Hello @lachmann12,

I'm quite certain kallisto does not remove any transcripts, but it would be worth testing. Do you have the initial transcript file that was used to create kallisto index?

lachmann12 commented 4 years ago

Hi,

, so I looked into the raw output files and it is the same difference as you describe. I should have the exact data files on my work computer but since it is in a New York hospital the lab is currently working remotely.

itszhengan commented 4 years ago

Hi, So you mean the current data is wrong? If so please update the data and reference genome version thank you. Best, Zheng

在 2020年4月25日,01:23,Alexander Lachmann notifications@github.com 写道:

Hi,

, so I looked into the raw output files and it is the same difference as you describe. I should have the exact data files on my work computer but since it is in a New York hospital the lab is currently working remotely.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.