eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
556 stars 105 forks source link

Question: Depth of GO annotations #389

Open marchoeppner opened 2 years ago

marchoeppner commented 2 years ago

Hi,

I have been trying to incorporate Eggnog Mapper into a genome annotation pipeline ; and find that when using the "Metazoan" reference database on an annotation of a fish genome, I sometimes get hundreds of GO terms attached to a given mRNA. The model looks "sane" and the orthologous gene in another fish species in EnsEMBL has 5 or 6 GO terms.

So I suspect what I am seeing is somehow "wrong", or maybe Eggnog Mapper attaches the entirety of the GO graph per "tip" to each mRNA model? I would have expected maybe a handful of terms. The documentation does not elaborate on this, as far as I can tell. Maybe someone could clarify how this is supposed to work...

Many thanks, Marc

Cantalapiedra commented 2 years ago

Hi @marchoeppner ,

thank you for reporting this. Could you provide the specific example with the GO terms and those you obtain from Ensembl?

Best, Carlos

marchoeppner commented 2 years ago

Certainly.

Just randomly picking out one example where the list of GO terms seems excessively long.

This the "decorated" mRNA for the locus shown below:

ptg000007l EVM mRNA 20166438 20178559 . + . ID=evm.model.ptg000007l.513;Parent=evm.TU.ptg000007l.513;Name=TTC39C;em_target=144197.XP_008275783.1;em_score=1132.0;em_evalue=0.0;em_tcov=100.0;em_OGs=KOG3783@1|root,KOG3783@2759|Eukaryota,38B40@33154|Opisthokonta,3BFI6@33208|Metazoa,3CV0M@33213|Bilateria,489B0@7711|Chordata,490YD@7742|Vertebrata,49ZC9@7898|Actinopterygii;em_COG_cat=S;em_desc=Tetratricopeptide repeat domain 39C;em_max_annot_lvl=33208|Metazoa;em_Preferred_name=TTC39C;em_PFAMs=DUF3808;Ontology_term=GO:0006996,GO:0007275,GO:0007368,GO:0007389,GO:0007423,GO:0007507,GO:0008150,GO:0009653,GO:0009790,GO:0009799,GO:0009855,GO:0009887,GO:0009987,GO:0016043,GO:0022607,GO:0030030,GO:0030031,GO:0032474,GO:0032501,GO:0032502,GO:0042471,GO:0042472,GO:0043583,GO:0044085,GO:0044782,GO:0048513,GO:0048562,GO:0048568,GO:0048598,GO:0048731,GO:0048839,GO:0048840,GO:0048856,GO:0060271,GO:0061371,GO:0070925,GO:0071840,GO:0072359,GO:0090596,GO:0120031,GO:0120036

This is the orthologous gene from a fish species in EnsEMBL (4 GO terms in total):

http://www.ensembl.org/Amphiprion_percula/Gene/Summary?db=core;g=ENSAPEG00000021475;r=2:10904441-10931910;t=ENSAPET00000031054

And this is the locus in question in WebApollo:

Bildschirmfoto 2022-06-01 um 08 10 43

Protein sequence for this mRNA:

evm.model.ptg000007l.513 MAGPEQSQQQQQVEEKAEHIDDAEMALQGINMLLNNGFKESDELFRRYRTQSPLMSFGASFVSFLNAMMT FEEEKMQTACDDLRTTEKLCESDSAGVIETIRNKIKKSMDSQRSGVVVIDRLQRQIIVADCQVYLAVLSF VKQELSAYIKGGWILRKAWKMYNKCHSDISQLQESCQRRSSGNQESLSADNANHNAPVENAVTAEALDRL KGSVSFGYGLFHLCISMVPPHLLKIINLLGFPGDRLQGLSSLMYASESKDMKAPLATLALLWYHTVVLPF FALDGSDTHEGLLEAKAILQRKSVVYPNSSLFMFFKGRVQRLECHINSALACFHDALELASDQREIQHVC LYEIGWCSMIEMNFEDAYRAFERLKNESRWSQCYYAYLTGVCQGAAGDLDGASGVFKDVQKLFKRKNNQI EQFAVKRAERLRKISPTRELCILGVIEVLYLWKALPNCSSSKLQIMNQVLQSLDEASCRGLKHLLLGAIH KCHGNVRDALQSFQLAARDEYGRQINSYVQPYAVYELGCVLLGKPETVGKGRSLLLQAKEDFTGYDFENR LHVRIHSALASLKEVVPQ

Cantalapiedra commented 2 years ago

Yes, thank you!

It seems to me that it is indeed reporting the whole GO ontology (as you previously suggested). For example, https://www.ebi.ac.uk/QuickGO/term/GO:0032474

Hopefully in future versions of the database we could improve this, to try to report only the most meaningful GO terms.

Best, Carlos

marchoeppner commented 2 years ago

Thanks for the quick feedback, Yes, it would certainly be desirable to filter down the list of terms to the relevant ones (as the rest of the graph is implicitly included anyway). I would expect that's what most people are looking for anyway. Closing this for now and looking forward to a "fix".

Cantalapiedra commented 2 years ago

Thank you @marchoeppner

marchoeppner commented 1 year ago

Reopening this as there has been no movement so far.

timase2021 commented 1 year ago

I am new with Bioinformatics, I used eggnog to annotate a group of protein sequences and like you I have several GO references for a same query I checked on https://www.ebi.ac.uk/QuickGO/term/GO:0000122 and indeed it sometimes traces the pattern not in full...

The results obtained with eggnog are not very well explained, how do you interpret these results? With those GOs results how can I get a Nice graph like those seen in publication?

Is there documentation that explains what each result corresponds to? for exemple Description is that refer to the COG_category description?

query | seed_ortholog | evalue | score | eggNOG_OGs | max_annot_lvl | COG_category | Description | Preferred_name | GOs | EC | KEGG_ko | KEGG_Pathway | KEGG_Module | KEGG_Reaction | KEGG_rclass | BRITE | KEGG_TC | CAZy | BiGG_Reaction | PFAMs

marchoeppner commented 1 year ago

Hi, seems like you are looking for the documentation of output formats: https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.8#user-content-Output_fields

Azurelan35 commented 9 months ago

hey @marchoeppner @Cantalapiedra @timase2021 I wonder how you deal with this GO redundancy issue in the end? Cause I am having the same struggle currently... Thx for your reply in advance. Lan

Cantalapiedra commented 8 months ago

From our side, we couldn't implement a solution for it yet. You may need to rely on external tools parsing and processing lists of GO terms. Sorry for the inconveniences.

Best, Carlos