eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
562 stars 105 forks source link

Why am I getting more CDs with KO numbers when using the tax scope "bacteria" compared to without the tax scope "bacteria"? #469

Open Julboteroc opened 1 year ago

Julboteroc commented 1 year ago
Dear developers,

I hope this message finds you well.

I ran Eggnog Mapper with the parameters "-tax scope bacteria" and without the "-tax scope bacteria" parameter on my bacterial genomes. I noticed that I obtained a greater number of CDs with KO numbers when using the "-tax scope bacteria" parameter. I would like to understand why this is the case, considering that I am restricting the database to the bacterial domain.

Furthermore, when analyzing the completeness of KEGG modules, I also observed a higher number of complete modules when using the "-tax scope bacteria" parameter.

Could you kindly explain the reason behind these observations?

Thank you in advance for your assistance.
bacteria genomes | cds_annotated with tax bacteria | cds_annotated without tax bacteria | cds_annot_with ko numbers_with tax bacteria | cds_annotated ko numbers_without tax -- | -- | -- | -- | -- FA2 | 1223 | 1228 | 949 | 877 FAT | 1118 | 1122 | 840 | 778 FB | 1131 | 1133 | 860 | 791 TY | 1447 | 1451 | 1047 | 960

Julboteroc commented 1 year ago

I apologize for not mentioning earlier that when using the "-tax scope bacteria" parameter, I obtained a lower number of annotated CDS (coding sequences). However, among the annotated CDS, a higher number of them had KO numbers, as demonstrated in the table.

Cantalapiedra commented 1 year ago

Hi @Julboteroc ,

It is possible that when restricting the tax scope you are getting more annotation terms which exceed the thresholds required to be reported. Just my 2 cents. But to be sure in your case, we may need to check specifically what is happening with your input sequences.

Best, Carlos

Julboteroc commented 1 year ago

Dear @Cantalapiedra,

Thank you very much for your response. In my case, I used the following argument:

emapper.py -i FA2.faa -o FA2--output_dir output-eggnog/ -m diamond --cpu 8 --report_orthologs --dbmem --override --tax_scope Bacteria

I understand that by specifying the "Tax_scope Bacteria" (with uppercase), I might be losing speciation events. Therefore, my results in the column only show "max_annot_lvl.: Bacteria" However, would it also be advisable to use "--tax_scope bacteria" (with lowercase)?

When I used eggnog_mapper without specifying the tax scope as mentioned below:

emapper.py -i FA2.faa -o FA2--output_dir output-eggnog/ -m diamond --cpu 8 --report_orthologs --dbmem --override

I noticed that many genes were not annotated. So, according to your comment, is it not recommended to use "Tax_scope Bacteria" as it exceeds the required limit and generates false positives?

Thank you very much for developing this amazing tool.

Best regards, Juliana

Cantalapiedra commented 10 months ago

Hi @Julboteroc ,

Sorry for the veeeery late response.

--tax_scope Bacteria and --tax_scope bacteria are very different in theory, but I have not checked how different are in practice. Note that --tax_scope, as it is defined here https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.12#annotation-options , will be intersected with the tax groups of the seed ortholog OG. If am not mistaken, if you specify "Bacteria", then only when you have "Bacteria" among the tax groups of your seed ortholog you will get annotations. However, if you use 'bacteria', this is not a "LIST_OF_TAXA" (see the link above), but a "PREDEFINED_FILENAME" that we created (see the file eggnogmapper/annotation/tax_scopes/bacteria in your eggnog-mapper directory). Therefore, the intersection of this list with the list of taxa from the seed ortholog will be used.

I guess that if you use Bacteria, you will get broader trees from which to annotate, whereas if you use bacteria you should get more specific annotations.

The default --tax_scope ('auto' or 'all'; see eggnogmapper/annotation/tax_scopes/all) includes Bacteria and many other groups (like Archaea or Eukaryota). 'all' includes 108 groups and 'bacteria' only 58. All the groups from 'bacteria' are also included in 'all'.

Regarding the number of annotated CDS and number of CDS with KO terms, as I told you, it is difficult to me without trying with your data and seeing myself what is actually happening. In principle, I would expect more annotated CDS and more KO terms using the default --tax_scope than 'Bacteria' or 'bacteria', because the default one is wider. However, there are other rules when retrieving annotations, like the orthology relationships of the seed ortholog with the other members of the OG being used to retrieve annotations.

I hope this is of help.

Best, Carlos