althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
132 stars 5 forks source link

Feature request: Make assumptions from ambiguous bases in translation #54

Closed schorlton-bugseq closed 2 months ago

schorlton-bugseq commented 2 months ago

Thanks again for the great tool! I noticed some differences in CDSs when annotating/calling with prokka vs p(y)rodigal. As I was a bit surprised, I investigated further.

For pyrodigal/prodigal, they convert GGN to X (when using genetic code 11). Prokka uses bioperl (I think) on the FNA and converts these bases to G. Biopython takes a similar approach:

>>> from Bio.Seq import Seq
>>> dna_sequence = Seq("GGN")
>>> dna_sequence.translate(table=11)
Seq('G')

I think the Bioperl/Prokka/Biopython approach is better because regardless of what base that N is, using code 11, the translation is G. I understand this deviates from prodigal and so even having this as an option would be super appreciated. As an interim solution, I recognize I could extract the nucleic sequence based on coordinates and translate with biopython. Thanks for your consideration!

oschwengers commented 2 months ago

Interesting!

Also:

>>> from Bio.Seq import Seq
>>> dna_sequence = Seq("CCN")
>>> dna_sequence.translate(table=11)
Seq('P')

So, although there is are arbitrary nucleotides at the 3rd position (N), BioPython seems to convert both CCN and GGN to P and G, respectively, since these triplets do not depend on the 3rd nucleotide.

It would be interesting to see if Pyrodigal also translates CCN to X ...

althonos commented 2 months ago

So that's actually something that I sticked to for compatibility with prodigal: the code for translating amino-acid in Prodigal supports ambiguous bases:

char amino(unsigned char *seq, int n, struct _training *tinf, int is_init) {
    /* ... */
    if(is_c(seq, n) == 1 && is_g(seq, n+1) == 1) return 'R';
    /* ... */
}

But the code for writing translations to FASTA does actually mask any codon with one or more ambiguous base:

void write_translations(FILE *fh, struct _gene *genes, int ng, struct 
                        _node *nod, unsigned char *seq, unsigned char *rseq, 
                        unsigned char *useq, int slen, struct _training *tinf,
                        int sctr, char *short_hdr) {
    /* ... */
    for(j = genes[i].begin; j < genes[i].end; j+=3) {
        if(is_n(useq, j-1) == 1 || is_n(useq, j) == 1 || is_n(useq, j+1) == 1) 
          fprintf(fh, "X");
        else fprintf(fh, "%c", amino(seq, j-1, tinf, (j==genes[i].begin?1:0) &&
                     (1-nod[genes[i].start_ndx].edge)));
        if((j-genes[i].begin)%180 == 177) fprintf(fh, "\n");
      }
    /* ... */
}

I can add support for ambiguous bases relatively easily with an extra flag (for backwards compatibility), if that sounds good.

schorlton-bugseq commented 2 months ago

That would be awesome and much appreciated. One other suggestion (sorry if this should be a separate thread): the user should also be able to set translation code for Genbank output. Right now it appears fixed at 11 and therefore it is possible to get different outputs from FASTA and Genbank outputs. Right now, my workflow is:

  1. Call CDS with pyrodigal
  2. Write to Genbank
  3. Read in Genbank, re-translate every CDS with biopython
  4. Write updated Genbank and FASTA (with updated translation and headers as per https://github.com/althonos/pyrodigal/issues/53)

Probably a couple changes would enable skipping that reading and re-translation. Thanks again for your hard work on the tool!

althonos commented 2 months ago

Right now it appears fixed at 11 and therefore it is possible to get different outputs from FASTA and Genbank outputs.

It is actually fixed at whatever translation table is used in the training info; either whatever was provided as the training_info argument to GeneFinder.train, or whatever translation table is associated with the metagenomic bin while in metagenomic mode. This means that unless you change the translation table, you won't get inconsistencies. If you do want to change the translation table, you should do that when training / before running gene finding.

schorlton-bugseq commented 2 months ago

Got it - that makes a lot of sense! Sorry for the confusion. Is it possible that meta mode is used to call CDS on an organism with a different genetic code and a user would want to translate with that code? The (admittedly extremely rare) case I'm thinking is you know you have 10kb of Mycoplasma and use meta mode because it's too short for single mode. It appears to me that Mycoplasma species are included in meta mode (https://github.com/hyattpd/Prodigal/blob/c1e2d361479cc1b18175ea79ebd8ff10411c46cb/metagenomic.c#L79-L85)?

althonos commented 2 months ago

The (admittedly extremely rare) case I'm thinking is you know you have 10kb of Mycoplasma and use meta mode because it's too short for single mode. It appears to me that Mycoplasma species are included in meta mode

I suppose in that case that the Mycoplasma model will have the highest score, so it will indeed use the Mycoplasma translation table! And if you know you always will process Mycoplasma data you can train your own model and just use that one for all your contigs (and you could even release them, like in pyrodigal-gv with the giant viruses)!

schorlton-bugseq commented 2 months ago

Amazing - thanks again!

althonos commented 2 months ago

I refactored the translation code so that an ambiguous second or third base is handled correctly for each translation table: this needs to be enabled with a flag (for backward compatibility):

gene.translate(strict=False)

I also added new translation tables (up to 33) although not all of them are supported in GeneFinder.train (27, 28 and 31 cannot be used because of the context-dependent STOP codon which conflicts with Prodigal's method).

althonos commented 2 months ago

Now available in v3.4.0 :+1: