cruizperez / MicrobeAnnotator

Pipeline for metabolic annotation of microbial genomes
Artistic License 2.0
133 stars 27 forks source link

Genes identified via Swissprot & Trembl fail to match to KO numbers #19

Open RVTrexler opened 3 years ago

RVTrexler commented 3 years ago

Hi,

I have annotated a set of MAGs using the standard database mode. I am looking through the annotation results and realized that all genes identified by either swissprot or tremble do not match to KO numbers, even if EC numbers were identified for these genes. It seems possible that the KO numbers for (many or all of) these genes were identified via other databases (usually refseq). In all, KO numbers were only identified with kofamscan or refseq.

I'm not exactly sure if that is expected or not, but it seems strange to me that 2/4 of the reference databases would fail to find any KO matches (unless they're getting superseded by kofamscan & refseq?). Any insight/clarification about this would be appreciated. Attached is one of my annotation files for one of my MAGs as an example. I have found this to be the case for all of my MAGs.

S1.bin.6.fa.protein.translations.faa.annotations.xlsx

Thanks,

Ryan

cruizperez commented 3 years ago

Hi @RVTrexler, This is usually what happens in MicrobeAnnotator. The majority of proteins that will have a KO identifier associated with them are going to be identified with the KOfam data, while a small minority will be found using Swissprot, RefSeq, or Trembl. That is why if the aim if to find KOs for your proteins the --light mode is usually enough. However, if you were able to recover EC numbers the flag --refine will allow the transformation of those into KO. You can try that to see if the annotations improve. Note however, that EC numbers can be quite general and have matches with many KOs, so I advice caution and manually checking your results. Let me know if you need additional help with this.