althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
132 stars 5 forks source link

Inconsistent gene predictions for Pyrodigal and *fixed* Prodigal #21

Closed oschwengers closed 1 year ago

oschwengers commented 1 year ago

Hi and thanks so much for working on this. A patched, accelerated and potentially multi-threaded Prodigal alternative is very much needed and your effort very much appreciated!

We considered Pyrodigal as a promising alternative for Prodigal within Bakta and Platon. To add an extra layer of confidence we started a gene prediction comparison on 49 bacterial genomes. The large majority of all genes are equally predicted but unfortunately, some prediction slightly differ and some genes are not predicted by Pyrodigal at all.

We waited to double check everything but as I saw your issue https://github.com/althonos/pyrodigal/issues/19 today, I thought this could maybe help to further debug it. @jhahnfeld has conducted the comparison and could provide more info on this.

Here are the results using a locally compiled and patched Prodigal version:

genome strand prodigal_start prodigal_stop prodigal_edge pyrodigal_start pyrodigal_stop pyrodigal_edge
GCF_000008865 -1 2727810 2728103 False
GCF_000009045 -1 955653 955817 False
GCF_000020025 1 1689342 1689500 False
GCF_000022565 -1 3550577 3553432 False
GCF_000067165 1 2867204 2867602 False 2867036 2867602 False
GCF_000195955 1 3960860 3961786 False 3961052 3961786 False
GCF_000195955 1 2052933 2053289 False 2052969 2053289 False
GCF_000218265 -1 1551162 1551524 False
GCF_000222975 -1 1950139 1950465 False
GCF_000987925 -1 331455 331595 False
GCF_001457455 1 2243720 2243851 False
althonos commented 1 year ago

Hi Oliver, thanks a lot for reaching out, I'm happy to hear you're considering Pyrodigal for super useful tools like these! I've been advocating to Bakta for some time :wink:

I'm very thankful for the test set and the gene prediction comparison analysis, to be fair this is something I should have done myself earlier. I'm not yet entirely sure whether it's related to #19 or not because #19 actually affects the already computed genes by recording the wrong scores for display, but it shouldn't change the prediction of the genes themselves. Nevertheless I'll have a look at these genomes, it could be that in the v1 refactor I did some mistakes while adapting the original code.

Just so that I can reproduce, is this in meta or single mode? I also recently found out that Prodigal was doing some out-of-bound reads of sequence data while training, so I'll see about patching that, but I don't think that this is causing the discrepancies you've been seeing.

althonos commented 1 year ago

Just found the comparison repository, thanks @jhahnfeld for setting this up!

jhahnfeld commented 1 year ago

Hi Martin,

I just wanted to provide you with the link to the comparison repository, but you already found it :) It now has a results folder with the current differences.

Just so that I can reproduce, is this in meta or single mode?

We used the single mode.

althonos commented 1 year ago

I just tested the comparison by using Pyrodigal where instead of training I loaded the training file from Prodigal: the results matched 100%. So it looks like Pyrodigal has some discrepancies in the training process, I'll have a look ASAP!

jhahnfeld commented 1 year ago

I updated the prediction mismatches and removed the '-c' flag since the tested genomes are closed. Now there are some more mismatches and completely missing predictions.

zdk123 commented 1 year ago

I just want to +1 this effort. Finding a few [minor] differences in our tests as well in meta mode.

althonos commented 1 year ago

@zdk123 : If you have some example sequences with inconsistencies in meta mode please share them here! I have identified a bug in the training process that I am working on fixing, but meta mode shouldn't be concerned, so if you see discrepancies there too that another bug that i have to fix :sweat_smile:

althonos commented 1 year ago

Just shipped pre-release 2.0.0-rc.2, which seems to have fixed the issues on closed=True, where I'm not seeing any mismatches anymore. Problem was coming from a weird behaviour of Prodigal when in training mode that I didn't replicate exactly in my port of the score_connection function. (I'm actually thinking that's a bug in the original codebase, where a maximum search algorithm is not actually recording the maximum when in training mode).

Still seeing some mismatches in closed=False however, so I'll keep chasing bugs :smiling_face_with_tear:

zdk123 commented 1 year ago

@althonos Actually I think this may have been the same as issue #19 v2-rc2 seems to have fixed

althonos commented 1 year ago

Okay, found and patched a new bug (#22) which fixes the remaining issues on closed=False. On my machine the new v2.0.0-rc.3 release passes all comparison tests from @jhahnfeld's repository with both closed=True and closed=False. Please give it a try, if all is well I'll make a proper v2.0.0 after adding a bit of documentation and some minor fixes!

oschwengers commented 1 year ago

Thanks a lot for this. We'll have a look at this ASAP. Just out of curiosity: I'm wondering if this update implies larger code changes that would qualify for a major version bump to 2.0.0? IMHO a minor bump (1.2.0) would be sufficient, too if there are no breaking API changes. This might also help to keep in sync with a potentially upcoming Prodigal 2.0 version.

althonos commented 1 year ago

@oschwengers : There are breaking changes coming from #18 because I'm going to add a mandatory argument to all Genes.write_* methods. In addition I'll also probably change some attributes of TrainingInfo that return copies to return views of the underlying buffers. That's why I'm trying to synchronize the release a bit this time, so that I can put all the API breaking changes in there at once.

In general choosing the versioning scheme is always complicated for bindings, but I think in here I'm just going to let Pyrodigal have its own versioning scheme following semver :) And actually Prodigal is at version 2.6.3, so this leaves the door open for Prodigal 3.0.

oschwengers commented 1 year ago

Ahh, I see. OK, that makes totally sense. Thanks!

jhahnfeld commented 1 year ago

Hi, thank you for looking into the issues. I have tested Pyrodigal v2.0.0-rc.3 and still have two mismatches in GCF_000008865 with closed=False. For closed=True, I am getting new mismatches now that were not reported before (I have also added the report of mismatched stop codons.). As for the potential bug in the score_connection function, it might be good to report it upstream to check if the behavior is intended or not.

althonos commented 1 year ago

@jhahnfeld : I'm not seeing any mismatch in closed=True anymore on my end. Are you deleting the tmp folder between runs in closed=False and closed=True modes? If you don't, Prodigal will not perform the training if the training file already exists, so you will end up using the wrong training file. See PR : https://github.com/jhahnfeld/prodigal-pyrodigal-comparison/pull/1

jhahnfeld commented 1 year ago

Sorry, I forgot to delete the tmp folder between the runs. :sweat: Now there are no more mismatches in the comparison test. P.S. Thanks for the PR

oschwengers commented 1 year ago

Now, that we all see the same gene predictions between Prodigal and Pyrodigal, I'll close this issue. Just in case I've missed something, please do not hesitate to re-open it.

Great effort and results! Thanks a lot everyone! Can't wait to use Pyrodigal in our projects...

althonos commented 1 year ago

Thanks a lot for the extensive bug report and the comparison repository!