althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
138 stars 5 forks source link

Difference between Pyrodigal 2.0.4 and Prodigal 2.6.3 #28

Closed pchaumeil closed 1 year ago

pchaumeil commented 1 year ago

Hello,

I have run another big test to compare prodigal and pyrodigal across ~400K genomes.

from this test ~17K genomes have a difference in their gene calling between the 2 softwares. I have attached the list of all genomes with a difference here : index_genomes.txt

here are few examples of these differences:

for GCA_934838455.1:

diff prodigal_2.6.3_amino.faa pyrodigal_2.0.4_amino.faa
>CAKWEX010000332.1_3 # 2936 # 3754 # -1 # ID=332_3;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.602
MSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVETGF
RAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPLWFY
SPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPGKTR
LDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPNPLR
FPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*
---
>CAKWEX010000332.1_3 # 2936 # 3763 # -1 # ID=332_3;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.603
IGPMSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVE
TGFRAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPL
WFYSPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPG
KTRLDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPN
PLRFPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*

for GCA_934561095.1

diff prodigal_2.6.3_amino.faa pyrodigal_2.0.4_amino.faa
>CAKTGG010000246.1_2 # 738 # 2552 # -1 # ID=246_2;partial=00;start_type=TTG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.618
MKLSARKDVPVNETWDLSLIFAAEADFEAAVEKTKALADTLEKTYKNALTTPESIAECLA
LYEELEILLYQTTSYTSLAVSVDYTDTEAQKKDAKMSALAAEIGSRLSFIESEIADAPEE
LIRAAMDKTERAKHYLAEILREKPHRLSAETEKVLAALRPVFNAPYDIYHMTKLADMKFD
SFTVNSKEYPLGYSLFEDEYEYEADTDVRRAAFRAFSDKLRQYENTTAATYNTYLTQQRI
MAHQRGFADMFEADLFADHVTREMYDRQIDLITEKLAPAMRKYARLVGKMNKLDRVTFAD
LKLPLDAEFDPRVTIEESREYVRSALSVLGQDYADMVDEAYDKRWIDFARNAGKETGGFC
SSPYGCNSFILLSWNNRMADVFTIAHELGHAGHFRLCNGAQSLFDTNVSGYLIEAPSTMN
ELLLAQDLLRKNTDKRFRRWVLSSLIGHTYYHNFVTHLREAWYQREAMNIIEQGGAVNAE
TLSGIFRRNLETFWGDAVELTEGCELTWMRQPHYYMGLYSYTYSAGLTLATQAALNIAAE
GESAVARWRAMLEAGSTRGPLGLAEIAGIDLSTPDALEHTIAYISDIIDEIAVLTEELDG
ITLD*
---
>CAKTGG010000246.1_2 # 738 # 2564 # -1 # ID=246_2;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.616
EDIHLKLSARKDVPVNETWDLSLIFAAEADFEAAVEKTKALADTLEKTYKNALTTPESIA
ECLALYEELEILLYQTTSYTSLAVSVDYTDTEAQKKDAKMSALAAEIGSRLSFIESEIAD
APEELIRAAMDKTERAKHYLAEILREKPHRLSAETEKVLAALRPVFNAPYDIYHMTKLAD
MKFDSFTVNSKEYPLGYSLFEDEYEYEADTDVRRAAFRAFSDKLRQYENTTAATYNTYLT
QQRIMAHQRGFADMFEADLFADHVTREMYDRQIDLITEKLAPAMRKYARLVGKMNKLDRV
TFADLKLPLDAEFDPRVTIEESREYVRSALSVLGQDYADMVDEAYDKRWIDFARNAGKET
GGFCSSPYGCNSFILLSWNNRMADVFTIAHELGHAGHFRLCNGAQSLFDTNVSGYLIEAP
STMNELLLAQDLLRKNTDKRFRRWVLSSLIGHTYYHNFVTHLREAWYQREAMNIIEQGGA
VNAETLSGIFRRNLETFWGDAVELTEGCELTWMRQPHYYMGLYSYTYSAGLTLATQAALN
IAAEGESAVARWRAMLEAGSTRGPLGLAEIAGIDLSTPDALEHTIAYISDIIDEIAVLTE
ELDGITLD*

Is this the normal to have that many difference across all these genomes? Is Pyrodigal more accurate in this case?

Thank you

althonos commented 1 year ago

Thanks @pchaumeil, I'll have a look!

Is this the normal to have that many difference across all these genomes? Is Pyrodigal more accurate in this case?

I can't say for sure before finding out what's causing the discrepancy, the bug may come either from Prodigal or from Pyrodigal. From the examples you provided it looks like the issue is coming from RBS detection on the reverse strand edge, which I though I fixed as a cause of #27.

althonos commented 1 year ago

I think I found a bug, and it's actually coming from Prodigal:

In your first gene, there is a start codon located around 10bp of the contig edge on the reverse strand. Prodigal / Pyrodigal will look for the RBS motif, and will scan the region before the ATG codon: GGATAGGCCCCATG. The very beginning of the contig is GGA, which is a potential RBS.

Now, here's the difference:

As far as I can tell the bug only occurs on the reverse strand, which may explain why your two examples (and perhaps the rest of your discrepancies) all occur on a partial gene on the reverse strand.

althonos commented 1 year ago

Indeed, if you reverse-complement the contig and try again, both Pyrodigal and Prodigal agree on the following gene:

>CAKWEX010000332.1_1 # 3 # 830 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.601
IGPMSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVE
TGFRAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPL
WFYSPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPG
KTRLDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPN
PLRFPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*

which has an Edge start, so this means Prodigal accurately recognized the RBS motif as GGA this time, and discarded the start codon as expected. I'll make a bug report on the original repository :+1:

althonos commented 1 year ago

I'm gonna close this, given I think this was linked to https://github.com/hyattpd/Prodigal/pull/100. Feel free to open another issue if you keep saying this problem with the new version :+1: