apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
196 stars 19 forks source link

Genetic codes #114

Closed Khalimat closed 3 months ago

Khalimat commented 4 months ago

Hi all,

Thank you for an amazing tool!

I use this tool a lot (viral metagenomics data) and I have noticed that the predicted proteins are translated with different genetic codes: For instance, I ran this command and saw three different genetic codes used for one sample: grep ">" [sample_name]/[sample_name]_annotate/[sample_name]_proteins.faa | grep -o "genetic_code=[1-9]*" | sort | uniq -c 435370 genetic_code=11 12682 genetic_code=15 7762 genetic_code=4

I was expecting genetic code 11, but was surprised to see Blepharisma Nuclear Code 15 and 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code 4. Could you share how the tool chooses genetic codes?

apcamargo commented 4 months ago

Good question. I should include that in the documentation.

It's quite common for phages to use alternative genetic codes with stop codon reassignments[^1][^2]. If these are not considered during gene calling, predicted genes will often terminate prematurely, hindering annotation. When executed in metagenome mode, Prodigal (the gene caller most commonly used for phages) can identify genomes that use translation table 4, but not table 15. To improve support for translation table 15, I forked Prodigal to enable automatic identification of translation table 15 in metagenome mode. More recently, Martin Larralde incorporated my modifications into the pyrodigal-gv library, which geNomad has been using in the latest versions.

Now, regarding the automatic determination of alternative translation tables: In metagenome mode, Prodigal evaluates multiple pre-trained gene models and uses the "best" one for gene prediction in a given sequence. I leveraged this to enable prediction using translation table 15 by including additional models trained on phage genomes with TAG reassignment. When evaluating which model best fits a sequence with TAG reassignment, one of the newly trained models is likely to be selected, as they result in longer genes (translation table 11 would stop translation at TAG codons, which encode glutamine in these genomes). The same mechanism applies to translation table 4, but this table is also identified by the standard Prodigal, not just our fork.

You can read more about this in a recent paper we published, detailing how using a translation table 15-aware gene caller improves functional annotation.

How reliable are the predictions that genomes flagged with code 4 or 15 truly use these translation tables in nature? This is difficult to answer definitively. For longer sequences (e.g., longer than 10kb), I generally trust the prediction, especially if a TAG is present within a protein's functional domain. It is possible that some genomes using translation table 11 in nature will be translated with table 15 by pyrodigal-gv, but this is more likely in short genome fragments where no genes using a TAG stop codon are present.

[^1]: Ivanova, Natalia N., et al. "Stop codon reassignments in the wild." Science 344.6186 (2014): 909-913. [^2]: Borges, Adair L., et al. "Widespread stop-codon recoding in bacteriophages may regulate translation of lytic genes." Nature Microbiology 7.6 (2022): 918-927.

Khalimat commented 3 months ago

Thank you so for the detailed explanation!