eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
555 stars 106 forks source link

Missing COG #424

Open sowptika opened 1 year ago

sowptika commented 1 year ago

Hi, I recently tried eggnog-mapper website to annotate my bacterial genome for COG analysis. From the resulting output I was using the column of COG_categories to calculate the number of genes annotated under each COG category to create a graphical representation. But I realized that category R (General function prediction only), W (Extracellular structures) and X (Mobilome: prophages, transposons) are missing. Is this possible? Also does "Transposase" fall under the 'L' COG category?

Please help me out regarding my doubts.

Looking forward to your response and thank you in advance.

Sincerely, Sowptika Pal

Cantalapiedra commented 1 year ago

Hi @sowptika ,

I checked and there are OGs with R and W categories. However, there aren't OGs with X category. Regarding the L category, I see that there are OGs labeled as nucleases, endonucleases, helicases and transposases, among others.

I hope this is of help.

Best, Carlos

sowptika commented 1 year ago

Dear Carlos, I thank you for the information. But my data is not showing R category. Is it possible that the R category is not present in a bacterial genome? If not, then what could be the possible reason for no representation of R.

Best wishes,

Cantalapiedra commented 1 year ago

For what I see in the DB, the "R" is present only in 3 OGs, all of them described as Hsp70, and all of them from ssRNAs or Viruses.

4QE5G|10239|5|Hsp70 protein|R 4R0RP|35278|3|Hsp70 protein|R 4R125|439488|3|Hsp70 protein|R

The fields are OG, tax ID, size of OG (number of proteins), OG description, and COG category.

sowptika commented 1 year ago

Dear Carlos, I checked the R category representation in NCBI and under this category there are multiple annotations. Does the EggNOG database follow a different list as compared to NCBI COG database?

Best Wishes

Cantalapiedra commented 1 year ago

Dear Sowptika,

Unfortunately, I don't know the details regarding how COG categories were assigned to eggNOG 5 OGs.

From https://doi.org/10.1093%2Fnar%2Fgkv1248

"The individual assignments are made by a Support Vector Machine (SVM) classifier trained on proteins within COGs, KOGs and arKOGs, using as features text description words and substrings, protein domain and Gene Ontology term assignments, as well as KEGG pathway membership information."

and from http://eggnog5.embl.de/#/app/methods

"The functional categories introduced in COG (2), KOG (16) and arCOG (44) are employed. This is a controlled vocabulary of 20 functional categories to which the orthologous groups of those databases are assigned, and similarly, non-supervised orthologous groups (NOGs) are assigned to these categories using support vector machine classification applying available annotation [i.e. free text data, KEGG (45) pathway or module membership, SMART (46) or Pfam (47) domain content and Gene Ontology (48) annotations] as a feature space."

There are a few more technical details in the previous link.

So, I guess that yes, eggNOG annotations can be different to the ones found in NCBI, but to be honest I don't know to what extent. Sorry for not being of more help on this.

Best, Carlos

sowptika commented 1 year ago

Dear Carlos, Thank you for this detailed reply. I realize that eggnog works differently as compared to other COG annotator and hence the difference. But only a kind suggestion from an user point of view, I think the COG annotations should be the same when used from any platform so if possible these differences could be removed to get a homogeneous data, hence in future the results could be comparable. I thank you once again for all your time and help.

Best Wishes, Sowptika

Cantalapiedra commented 1 year ago

Dear Sowptika,

I think it is a very valid suggestion. We will try to improve this in future releases.

However, as of now, I can't criticize the previous authors who did the annotations of OGs, since I have not faced the problem myself.

Note also that these annotations will very likely never be exactly the same as in other databases, since you are getting a COG category for a whole OG (actually a non-supervised OG or NOG) and not for single proteins or for a supervised OG (COG, KOG, etc).

That being said, I understand that in the cases you described in this issue you are seeing very large differences, and thus this may be improved.

Thank you for your patience, and the suggestions, and sorry for any inconvenience.

Best, Carlos