althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
138 stars 5 forks source link

Imperfect Replication of Prodigal #14

Closed KGerhardt closed 2 years ago

KGerhardt commented 2 years ago

Hello again, Martin,

I've tried running your pyrodigal CLI against the original prodigal and noticed that there's some differences in the output. Most of the genes are identical or nearly identical, but differences do exist. I suspect that there's a difference in the training sequences.

I also created a wrapper around pyrodigal for my own purposes, and I train on a concatenation of the sequences in an input file, then run gene prediction on that basis. It looks to me like your code trains on each sequence, and then predicts genes on the per-sequence training model. My approach produces a result which is slightly closer to the original, but is also not identical to prodigal in all cases.

I've been trying to figure out exactly how prodigal selects its sequence for training - it doesn't look like it does either of our approaches. In the case of the attached genome, prodigal claims it has a training sequence 9511467 bp in length - among other possible issues, this is actually more bases than exist in the input file, there are only 9510879 bases - and I can't seem to figure out the problem.

Maybe your eyes can see the problem more clearly?

odd_pairs.zip

althonos commented 2 years ago

Hello @KGerhardt ,

The possible difference with Prodigal might come from a typo in the original code that was causing score penalization to be applied to any gene on the reverse strand, see hyattpd/Prodigal#88. The fix has been integrated to Prodigal, but since the authors have not made a new release, unless you recompiled from source it's likely the differences you are seeing come from a comparison between the unfixed version of Prodigal and the current Pyrodigal version (which has the fix).

There are actually tests for consistency between the fixed (recompiled) version of Prodigal and Pyrodigal in the pyrodigal.tests module, so normally this shouldn't be an issue. I'll have a look at your test sequences with the fixed Prodigal version to see if the two are still inconsistent.

althonos commented 2 years ago

I've tried running your pyrodigal CLI against the original prodigal and noticed that there's some differences in the output. Most of the genes are identical or nearly identical, but differences do exist. I suspect that there's a difference in the training sequences

Okay, so after running first in meta mode, I can confirm the "stock" Prodigal doesn't produce the same results as Pyrodigal, but after recompiling it does. So i supposed this would be the same with the single mode too, I assume it's a difference in training.

In the case of the attached genome, prodigal claims it has a training sequence 9511467 bp in length - among other possible issues, this is actually more bases than exist in the input file, there are only 9510879 bases - and I can't seem to figure out the problem.

The way Prodigal handles multiple sequences in the case of training is that it will concatenate them altogether but separated by TTAATTAATTAA to prevent genes from running between contigs by forcing STOP codons in all 6 frames. Your test sequence has 49 contigs, and 9511467 - 9510879 = 588 = 12 * 49.

althonos commented 2 years ago

Okay, after implementing the same kind of padding as Prodigal does i get exactly the same results in single mode as well. I'll make a release so that the Pyrodigal CLI also does this. This is basically the code in case you want to do it in your script:

# use the same interleaving logic as Prodigal
sequences = []
for i, seq in enumerate(parse("input.fna")):
    if i > 0:
        sequences.append("TTAATTAATTAA")
    sequences.append(seq.seq)
if len(sequences) > 1:
    sequences.append("TTAATTAATTAA")

training_info = pyrodigal.train("".join(sequences))
KGerhardt commented 2 years ago

Ah I see. That little interleaving bit makes sense of the numbers of bases.

Thank you very much! I really appreciate your work.