eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
560 stars 105 forks source link

Output .gff file interpretation (and best practice for input file)? #369

Closed jade-davies closed 2 years ago

jade-davies commented 2 years ago

QI0013_eggnog_prokkafaa.gff.xlsx QI0013_eggnog_genome.gff.xlsx

Hello,

I've annotated a bacterial genome using the web interface, both directly from the genome sequence (in FASTA file format), and also from a protein annotation file (generated by prokka, in FAA file format).

I have downloaded the out.emapper.decorated.gff files for each one, and I'm having a little trouble with interpreting the files. Please find the files attached - please could you explain what the different descriptions mean in the annotation, such as em_target, em_desc, em_max_annot_lvl, em_PFAMs? All this information seems to be placed within one tab.

Additionally, the annotation files seem quite different for the same genome, the only difference is the input file. Which method is better to use for input to the eggnog mapper?

Thank you in advance,

Jade

Cantalapiedra commented 2 years ago

Hi @jade-davies ,

The main difference you will note in your GFF files is that the one for the genome is showing the positions of the features relative to the contigs, whereas for the proteins the GFF will show the positions relative to those proteins. Also, if you upload the genome, the gene prediction is done from scratch, and therefore the features could be different from the proteins you already have from Prokka.

Regarding the different "em_" fields, these are just the same fields you should find in the ".emapper.seed_orthologs" and ".emapper.annotations" files. Please check https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.7#Output_format. Ask if you need further info on any field.

Regarding what input is better, in general I would say that if you already did the Prokka analysis you could continue using just those proteins.

I hope this is of help.

Best, Carlos

jade-davies commented 2 years ago

Hi @Cantalapiedra,

Thanks so much for your reply, it's really helpful! If I understand correctly, the STRING database is used for the protein IDs, so I can get gene ontology information directly from there?

Thanks again, Jade

Cantalapiedra commented 2 years ago

Hi @jade-davies ,

To be honest, I never worked with the STRING database, so I am afraid cannot help you on this. Maybe some of the IDs could be shared, but I am not sure to what extent. Hopefully someone with more experience with it could be of more help.

Best, Carlos

Cantalapiedra commented 2 years ago

I will close this issue (since the original discussion seems finished).

Please, re-open or re-issue if needed.

Best, Carlos