Source of KEGG, BiGG, and GO terms in emapper output?

taltman commented 4 years ago

First off, thank you @jhcepas for your excellent work on the EggNOG project. I appreciate the fine work that you have put into this.

I am trying to get a better understanding of the source of the controlled vocabulary annotations provided in the EggNOG mapper 2.0 output, aside from the COG category.

For example, the following Wiki page describes annotation columns 5 & 7-18, but there is no explanation for how they are derived.

https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2

Looking at the last few EggNOG papers, I only see descriptions about how the COG Functional Category (column 21) and the free text description (column 22) are derived.

Of course it is easy enough to think of a naive way that this is achieved, by bringing in the annotations for the sequences within an OG at a given taxonomic level, but this begs the following questions:

How does EggNOG resolve when there are conflicting controlled vocabulary terms? Does it just take the union?
How did the EggNOG team validate that these controlled vocabulary terms are assigned correctly using EggNOG Mapper 2.0?

Any clarity that you can provide regarding my above questions would be greatly appreciated! Thanks in advance!

taltman commented 4 years ago

I hope this finds the devs doing well. Any help on this question would be greatly appreciated!

Cantalapiedra commented 4 years ago

Hi,

regarding question 1, if I understood correctly, eggNOG-mapper would take the union of GO terms from the identified orthologs.

I am not sure about question 2. Do you mean eggNOG annotation of GO terms, or eggNOG-mapper annotation of a query from those eggNOG annotations?

For example, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238944/

"An important feature of eggNOG is that it provides functional annotations for the orthologous groups. These annotations are produced by a pipeline, which summarizes the available functional information on the proteins in each cluster: (1) the textual annotation for these proteins, (2) their annotated Gene Ontology (GO) terms (14), (3) their membership to KEGG pathways (15) and (4) the presence of protein domains from SMART (16) and Pfam (17)."

taltman commented 4 years ago

Hi @Cantalapiedra , my apologies for the delayed response.

I appreciate the citation. I did look at it briefly before, and again just now. It states that KEGG & GO terms are gathered using their pipeline, but it doesn't state 1) how the pipeline works, nor 2) the upstream source of functional annotations. Raw protein sequences don't have annotations; they must come from somewhere. :-) What is the source? And as I asked above, what is the source of these functional annotations as displayed in the EggNOG-Mapper output in columns 5 & 7-18?

And I'm concerned about how reliable these annotations are. Whether these are based on computational predictions (e.g., like TrEMBL), versus being based on human curation (e.g., like SwissProt), it makes a big difference in the trustworthiness of the results. So two questions:

What is the nature of the source of each of these types (each column) of annotation? Computational predictions or manual curation? Or both?
What validation has been performed to see how accurate EggNOG-Mapper is in correctly predicting these non-GO annotations?

Thanks in advance!

Cantalapiedra commented 4 years ago

Hi @taltman ,

Regarding the sources of annotations, I would say that annotations are based in both computational predictions and manual curation, depending on the specific database being used for annotation (KEGG, GO, etc) and the source of proteins (which are already annotated in the source: Ensembl, RefSeq, etc).

You have references here: https://academic.oup.com/nar/article/47/D1/D309/5173662 Also, there are additional details here https://academic.oup.com/nar/article/44/D1/D286/2503059 and here http://eggnog5.embl.de/#/app/methods

Regarding eggNOG-mapper, the main validation you can find is the CAFA2 benchmark here https://pubmed.ncbi.nlm.nih.gov/28460117/ You can also find a benchmark for metagenomics data there.

We have also done some tests with PFAM domains, and in general the wider the tax_scope the more annotations you get, but likely the more false positives.

I hope this helps.

Best, Carlos

taltman commented 3 years ago

Hi Carlos,

Thanks for your reply. So I take from your reply that, in regards to my original questions:

All of the annotations of NOGs in EggNOG are computational predictions, and have not been manually reviewed for errors
No validation has been performed to verify that these non-GO term annotations are accurate.

If I have misunderstood your references, please let me know.

Cantalapiedra commented 3 years ago

Hi @taltman ,

I hope another reference will help with your previous 2 questions:

"Nevertheless, the incompleteness of the COG membership and the absence of up-to-date COG annotations have become major impediments to the use of this system in comparative genomics. A major extension of the COGs is implemented in the EggNOG database, with an increased number of genomes included and new clusters of orthologs (denoted NOGs, after Non-supervised Orthologous Groups); however, EggNOG is completely automatic, without manual supervision of the cluster membership or annotation (21)."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383993/

eggnogdb / eggnog-mapper

Source of KEGG, BiGG, and GO terms in emapper output? #216