althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
138 stars 5 forks source link

Non-deterministic behaviour #29

Closed aaronmussig closed 2 weeks ago

aaronmussig commented 1 year ago

Hello,

Firstly, thanks for your work with Pyrodigal! I wasn't able to determine if Pyrodigal should be deterministic, if it is then ignore this ticket.

I have an extremely rare case that took quite some time to identify, but Pyrodigal will occasionally give a different result when running via Shell vs. Python subprocess. The strange part is that the likelihood of Pyrodigal giving a different result is higher when running via a Python subprocess though.

To replicate the issue:

Dockerfile

FROM python:3.10-slim

RUN apt-get update && apt-get install -y \
    curl \
    unzip \
    && rm -rf /var/lib/apt/lists/*

RUN python -m pip install pyrodigal==2.0.4

RUN mkdir -p /data /results /tmp/download

WORKDIR /tmp/download

RUN curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_009700405.1/download?include_annotation_type=GENOME_FASTA&filename=GCA_009700405.1.zip" -H "Accept: application/zip" && \
    unzip GCA_009700405.1.zip && \
    rm GCA_009700405.1.zip && \
    mv ncbi_dataset/data/GCA_009700405.1/GCA_009700405.1_ASM970040v1_genomic.fna /data/genome.fna && \
    rm -rf /tmp/download

WORKDIR /data

Entering container

docker build -t pyrodigal_test . && docker run -it pyrodigal_test /bin/bash

Running Pyrodigal

#!/bin/bash

for i in {1..100}
do
   python -c "import os; os.system('pyrodigal -m -i /data/genome.fna -g 11 -o /dev/null -a /results/python_$i.faa -d /dev/null -p single')";
   pyrodigal -m -i /data/genome.fna -g 11 -o /dev/null -a /results/shell_$i.faa -d /dev/null -p single;
done

Results

Over 200 trials I get the following results:

Hash Command Line (count) Python os.system (count)
a9f114 192 178
597610 8 22

The differences between the two hashes are:

7082,7090c7082,7090
< >WLMD01000046.1_10 # 9095 # 10405 # -1 # ID=101_10;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.487
< MSADDQLRKQQEFVLRTIEERNIRFVRLWFTDVLGFLKSVAIAPAELANAFDEGIGFDGS
< AIEGFARITESDMLAKPDPSTFSVLPWRTEAPGAARMFCDIVMPDGSASHADPRHVLRRI
< LNKAATMGYTCYTHPEIEFFLFKDRPEIGKRPTPVDQGGYFDHTPAVVGHDFRRTAITML
< EAMGISVEFSHHEGAPGQQEIDLRYADALTTADNIMTFRHVVKEVALDQGFHASFIPKPF
< TDHPGSGMHTHVSLFQGEKNAFYDAKAEYNLSKVGRSFIAGLLRHAPEITAVTNQWVNSY
< KRLHGGGEAPALVNWGHNNRGALVRVPMYKPNNENSTRVEFRSPDSACNPYLAYAVMIAA
< GLKGVEEGYELADSSDATVLPSNLNEAIIAMEKSALVRETLGEHVFEYVLRNKRAEWNDY
< SRQVTAYELDRYLPIL*
---
> >WLMD01000046.1_10 # 9095 # 10414 # -1 # ID=101_10;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.487
> YVPMSADDQLRKQQEFVLRTIEERNIRFVRLWFTDVLGFLKSVAIAPAELANAFDEGIGF
> DGSAIEGFARITESDMLAKPDPSTFSVLPWRTEAPGAARMFCDIVMPDGSASHADPRHVL
> RRILNKAATMGYTCYTHPEIEFFLFKDRPEIGKRPTPVDQGGYFDHTPAVVGHDFRRTAI
> TMLEAMGISVEFSHHEGAPGQQEIDLRYADALTTADNIMTFRHVVKEVALDQGFHASFIP
> KPFTDHPGSGMHTHVSLFQGEKNAFYDAKAEYNLSKVGRSFIAGLLRHAPEITAVTNQWV
> NSYKRLHGGGEAPALVNWGHNNRGALVRVPMYKPNNENSTRVEFRSPDSACNPYLAYAVM
> IAAGLKGVEEGYELADSSDATVLPSNLNEAIIAMEKSALVRETLGEHVFEYVLRNKRAEW
> NDYSRQVTAYELDRYLPIL*

i.e. the 597610 hash starts with Y instead of M.

althonos commented 1 year ago

Hi @aaronmussig,

Normally Pyrodigal should indeed be deterministic. Non-determinism is often caused when some parts of the program are reading undefined memory, often out of bounds.

This may be linked to https://github.com/hyattpd/Prodigal/pull/100 as I'm seeing the error is happening next to the edge on the reverse strand. I'll check if the fixed version still causes the issue.

aaronmussig commented 2 weeks ago

Hi @althonos,

Re-ran this using v3.5.2 and both execution methods are returning the same result. Thanks for looking into it!

althonos commented 2 weeks ago

Good to hear, thanks!