althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
132 stars 5 forks source link

BrokenProcessPool #46

Closed oschwengers closed 8 months ago

oschwengers commented 9 months ago

Hi @althonos, thanks for the recent 3.0 version and all its improvements. I'm currently working on a patch to bring this into Bakta.

Primary, I wanted to implement a multi-threaded version of this as suggested in https://github.com/althonos/pyrodigal#-thread-safety. However, in meta mode, I always run into a concurrent.futures.process.BrokenProcessPool. I cannot 100% rule out that maybe there is something wrong on our site, but after playing around and running a minimal example, I wanted to let you know - just in case there might be bug lurking in Pyrodigal.

So, using a small plasmid as input, the following minimal example is working as expected:

import sys
import concurrent.futures as cf
import pyrodigal
from Bio import SeqIO

sequences = []
with open(str(sys.argv[1])) as fh:
    for record in SeqIO.parse(fh, 'fasta'):
        sequences.append(str(record.seq).upper())

gene_finder = pyrodigal.GeneFinder(meta=True, metagenomic_bins=None, closed=False, mask=True)
for genes in map(gene_finder.find_genes, sequences):
    for gene in genes:
        print(gene)

However, If i switch to a parallel setup, this is not working anymore:

import sys
import concurrent.futures as cf
import pyrodigal
from Bio import SeqIO

sequences = []
with open(str(sys.argv[1])) as fh:
    for record in SeqIO.parse(fh, 'fasta'):
        sequences.append(str(record.seq).upper())

gene_finder = pyrodigal.GeneFinder(meta=True, metagenomic_bins=None, closed=False, mask=True)
with cf.ProcessPoolExecutor(max_workers=4) as ppe:
    for genes in ppe.map(gene_finder.find_genes, sequences):
        for gene in genes:
            print(gene)

Giving this stacktrace:

Traceback (most recent call last):
  File "/home/oliver/software/bakta/bakta/bakta/pyrodigal-test.py", line 13, in <module>
    for genes in ppe.map(gene_finder.find_genes, sequences):
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/oliver/tmp/conda-envs/bakta-test/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

So this only occurs in meta mode; the parallel implementation works fine in parallel on larger sequences running in default non-meta mode. Any ideas why this is not working? Thanks a lot in advance!

Best regards Oliver

althonos commented 8 months ago

Hi @oschwengers,

I haven't used future-based concurrency that much, but I notice that you're using ProcessPoolExecutor; using a ThreadPoolExecutor should work fine because most of the Pyrodigal code is nogil so it should run with true parallelism even within a single process. Tell me if you have errors there -- otherwise, I guess the problem may come from inter-process communication and I'll have a look.

Cheers :smile:

oschwengers commented 8 months ago

Thanks @althonos, indeed using a ThreadPoolExecutor seems to work. This is very interesting, since I always thought that CPU bound tasks could not effectively parallelized by Python threads. I guess some Cython magic?

Anyways, thanks a lot for this! I'll test this a bit and maybe re-active the parallel gene prediction in Bakta which indeed saves a few tens of seconds in meta mode on a draft genome.

althonos commented 8 months ago

Indeed, Cython lets you declare code that runs in no-GIL mode, but for that you need to have code that doesn't interact with the Python interpreter in any way during these sections -- this is the case in Pyrodigal because the whole computation uses C data structures (from the Prodigal code) and only wraps the results at the very end of the computation 🙂

I'll try to have a look at the process pool eventually but I don't have ideas as to what could be wrong right now !

oschwengers commented 8 months ago

Ah, I see. Thanks for the explanation. So far, everything seem to work using the ThreadPool which - of course - is faster on its own by avoiding all the back-and-forth copying of data, external process overhead, etc. Thanks again. I'll close this for now.

althonos commented 8 months ago

By the way, it looks like the error with ProcessPool was indeed caused by a pickling bug; I'll make a patch just in case, but it's quite likely that a ThreadPool is still faster.

oschwengers commented 8 months ago

Thanks a lot for the confirmation and quick fix. Indeed, the ThreadPool is much faster.