WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
234 stars 359 forks source link

annovar annotates bacterial genome variation #258

Open ChenDepp opened 1 month ago

ChenDepp commented 1 month ago

hi @d0ugal I hava question about annovar, As we known, the gene structures of eukaryotes and prokaryotes are different. so How to use annovar to annotate the mutations identified in bacterial genomes? have a good day. thanks

kaichop commented 1 month ago

instructions are here https://annovar.openbioinformatics.org/en/latest/user-guide/gene/#create-your-own-gene-definition-databases-for-non-human-species

ChenDepp commented 1 month ago

@kaichop thanks for you reply! I have built my own annotation database, but prokaryotes such as bacteria do not have exons, but the annotation results have exon annotation information. At the same time, the codon table of prokaryotes is different from that of eukaryotes. Will this affect the annotation information such as synonymous mutations and nonsense mutations?

kaichop commented 1 month ago

Hi @ChenDepp The current ANNOVAR only includes codon table for eukaryotes and mitochondria, so it does not have the full codon table for all possible scenarios. However, unless there is a known exception, bacteria do use the same codon table as eukaryotes, except that the start codon AUG encodes formylmethionine rather than methionine; still there is no real difference in the annotation process. This does not affect synonymous mutation and nonsense mutation.

For the gene annotation database, you can treat each gene in the operon as a single exon gene, so the annotation can still work just fine.

There is a more complex issues with SARS-CoV-2, where you can say all peptides are from the same ORF1ab gene, or you can say that there is one single large gene/protein which was processed to be multiple peptides such as ORF2, ORF4, etc. ANNOVAR handles both scenarios. Bacteria typically does not have this type of complication, as each gene in an operon works independently as a single protein product, so as long as you know the start and end position, you can treat it as a gene and do annotation.