Closed pchaumeil closed 1 year ago
Thanks @pchaumeil, I'll have a look!
Is this the normal to have that many difference across all these genomes? Is Pyrodigal more accurate in this case?
I can't say for sure before finding out what's causing the discrepancy, the bug may come either from Prodigal or from Pyrodigal. From the examples you provided it looks like the issue is coming from RBS detection on the reverse strand edge, which I though I fixed as a cause of #27.
I think I found a bug, and it's actually coming from Prodigal:
In your first gene, there is a start codon located around 10bp of the contig edge on the reverse strand. Prodigal / Pyrodigal will look for the RBS motif, and will scan the region before the ATG
codon: GGATAGGCCCCATG
. The very beginning of the contig is GGA
, which is a potential RBS.
Now, here's the difference:
A
after the contig end, and to mark the RBS motif as being AGGA. Because this is a high-scoring motif (7.85), the start codon is retained as the start of the gene.As far as I can tell the bug only occurs on the reverse strand, which may explain why your two examples (and perhaps the rest of your discrepancies) all occur on a partial gene on the reverse strand.
Indeed, if you reverse-complement the contig and try again, both Pyrodigal and Prodigal agree on the following gene:
>CAKWEX010000332.1_1 # 3 # 830 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.601
IGPMSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVE
TGFRAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPL
WFYSPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPG
KTRLDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPN
PLRFPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*
which has an Edge start, so this means Prodigal accurately recognized the RBS motif as GGA
this time, and discarded the start codon as expected. I'll make a bug report on the original repository :+1:
I'm gonna close this, given I think this was linked to https://github.com/hyattpd/Prodigal/pull/100. Feel free to open another issue if you keep saying this problem with the new version :+1:
Hello,
I have run another big test to compare prodigal and pyrodigal across ~400K genomes.
from this test ~17K genomes have a difference in their gene calling between the 2 softwares. I have attached the list of all genomes with a difference here : index_genomes.txt
here are few examples of these differences:
for GCA_934838455.1:
for GCA_934561095.1
Is this the normal to have that many difference across all these genomes? Is Pyrodigal more accurate in this case?
Thank you