Closed schorlton-bugseq closed 2 months ago
Interesting!
Also:
>>> from Bio.Seq import Seq
>>> dna_sequence = Seq("CCN")
>>> dna_sequence.translate(table=11)
Seq('P')
So, although there is are arbitrary nucleotides at the 3rd position (N
), BioPython seems to convert both CCN
and GGN
to P
and G
, respectively, since these triplets do not depend on the 3rd nucleotide.
It would be interesting to see if Pyrodigal also translates CCN
to X
...
So that's actually something that I sticked to for compatibility with prodigal
: the code for translating amino-acid in Prodigal supports ambiguous bases:
char amino(unsigned char *seq, int n, struct _training *tinf, int is_init) {
/* ... */
if(is_c(seq, n) == 1 && is_g(seq, n+1) == 1) return 'R';
/* ... */
}
But the code for writing translations to FASTA does actually mask any codon with one or more ambiguous base:
void write_translations(FILE *fh, struct _gene *genes, int ng, struct
_node *nod, unsigned char *seq, unsigned char *rseq,
unsigned char *useq, int slen, struct _training *tinf,
int sctr, char *short_hdr) {
/* ... */
for(j = genes[i].begin; j < genes[i].end; j+=3) {
if(is_n(useq, j-1) == 1 || is_n(useq, j) == 1 || is_n(useq, j+1) == 1)
fprintf(fh, "X");
else fprintf(fh, "%c", amino(seq, j-1, tinf, (j==genes[i].begin?1:0) &&
(1-nod[genes[i].start_ndx].edge)));
if((j-genes[i].begin)%180 == 177) fprintf(fh, "\n");
}
/* ... */
}
I can add support for ambiguous bases relatively easily with an extra flag (for backwards compatibility), if that sounds good.
That would be awesome and much appreciated. One other suggestion (sorry if this should be a separate thread): the user should also be able to set translation code for Genbank output. Right now it appears fixed at 11 and therefore it is possible to get different outputs from FASTA and Genbank outputs. Right now, my workflow is:
Probably a couple changes would enable skipping that reading and re-translation. Thanks again for your hard work on the tool!
Right now it appears fixed at 11 and therefore it is possible to get different outputs from FASTA and Genbank outputs.
It is actually fixed at whatever translation table is used in the training info; either whatever was provided as the training_info
argument to GeneFinder.train
, or whatever translation table is associated with the metagenomic bin while in metagenomic mode. This means that unless you change the translation table, you won't get inconsistencies. If you do want to change the translation table, you should do that when training / before running gene finding.
Got it - that makes a lot of sense! Sorry for the confusion. Is it possible that meta mode is used to call CDS on an organism with a different genetic code and a user would want to translate with that code? The (admittedly extremely rare) case I'm thinking is you know you have 10kb of Mycoplasma and use meta mode because it's too short for single mode. It appears to me that Mycoplasma species are included in meta mode (https://github.com/hyattpd/Prodigal/blob/c1e2d361479cc1b18175ea79ebd8ff10411c46cb/metagenomic.c#L79-L85)?
The (admittedly extremely rare) case I'm thinking is you know you have 10kb of Mycoplasma and use meta mode because it's too short for single mode. It appears to me that Mycoplasma species are included in meta mode
I suppose in that case that the Mycoplasma model will have the highest score, so it will indeed use the Mycoplasma translation table! And if you know you always will process Mycoplasma data you can train your own model and just use that one for all your contigs (and you could even release them, like in pyrodigal-gv
with the giant viruses)!
Amazing - thanks again!
I refactored the translation code so that an ambiguous second or third base is handled correctly for each translation table: this needs to be enabled with a flag (for backward compatibility):
gene.translate(strict=False)
I also added new translation tables (up to 33) although not all of them are supported in GeneFinder.train
(27, 28 and 31 cannot be used because of the context-dependent STOP codon which conflicts with Prodigal's method).
Now available in v3.4.0
:+1:
Thanks again for the great tool! I noticed some differences in CDSs when annotating/calling with prokka vs p(y)rodigal. As I was a bit surprised, I investigated further.
For pyrodigal/prodigal, they convert
GGN
toX
(when using genetic code 11). Prokka uses bioperl (I think) on the FNA and converts these bases toG
. Biopython takes a similar approach:I think the Bioperl/Prokka/Biopython approach is better because regardless of what base that
N
is, using code 11, the translation isG
. I understand this deviates from prodigal and so even having this as an option would be super appreciated. As an interim solution, I recognize I could extract the nucleic sequence based on coordinates and translate with biopython. Thanks for your consideration!