althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
139 stars 5 forks source link

Segmentation fault when annotating two Mycobacterial genomes with Pyrodigal #2

Closed chg60 closed 3 years ago

chg60 commented 3 years ago

Hello! First allow me to say that I'm so glad you built Pyrodigal, as it's both faster and far more convenient than when I was making subprocess calls to Prodigal!

I am working on a tool aimed to identify prophages (viral sequences integrated into bacterial host chromosome), specifically within Mycobacteria, and so far for the most part this is going well. A critical part of the workflow for this program is auto-annotation of the protein-coding genes, which are then examined for "phage-y" signal. I recently tried testing our program on a couple of larger Mycobacterial genomes:

M_kansasii_ATCC12478.txt M_marinum_E11.txt

Both these genomes have a >6 Mbp chromosome, and a >100 kbp plasmid. When I run the code in this file gene_prediction.txt on my Mac, it works just fine. But on my Linux workstation, something about the plasmids in both these genomes is triggering a segmentation fault (main chromosome runs fine).

I'm hoping you can tell me whether I'm just using Pyrodigal incorrectly or if this does in fact appear to be a bug that can be fixed?

Prodigal itself does not crash if I run that directly, but Pyrodigal is far more convenient so I would love to keep using it.

For context, the specs of the two machines are:

Mac laptop: Mid-2014 15-inch MacBook Pro, Retina 16 GB DDR3 memory, 1600 MHz 4-core/8-thread Intel core i7, 2.5 GHz macOS Catalina 10.15.7

Linux workstation: 2019 home build 128 GB DDR4 memory, 3200 MHz 8-core/16-thread AMD Ryzen 7 3700X Ubuntu 18.04 LTS

On both machines I have the most recent version of Pyrodigal installed in a conda environment (used pip to install, not the conda recipe), so the issue would appear to be OS or build-specific.

Please let me know if there are additional details that would be helpful!

althonos commented 3 years ago

Hi @chg60 , thanks for the detailed report! On my Linux laptop the example files you provided also segfault, so I'll be able to investigate.

althonos commented 3 years ago

So, I tracked the issue from inside the Cython code, and I actually made a mistake while adapting the code from Prodigal: in one instance where only a part of the dynamic programming nodes were supposed to be iterated on to eliminate bad genes, I passed the total number of nodes, which could cause a NULL read.

This has been fixed, and I don't get a segmentation fault anymore when running on the two genomes you provided. I'll publish a new release shortly.

althonos commented 3 years ago

Release v0.4.7 is out, it is now on PyPI and will be in Bioconda soon. Thanks for finding this package useful and using it for your research projects, maybe see you at a bioinformatics conference some day!

chg60 commented 3 years ago

I just installed v0.4.7 and confirm that it no longer gives me the segmentation fault. Thank you for taking care of that so quickly, and thank you again for this awesome tool. Best, Christian