Closed oschwengers closed 8 months ago
Hi @oschwengers,
I haven't used future-based concurrency that much, but I notice that you're using ProcessPoolExecutor
; using a ThreadPoolExecutor
should work fine because most of the Pyrodigal code is nogil
so it should run with true parallelism even within a single process. Tell me if you have errors there -- otherwise, I guess the problem may come from inter-process communication and I'll have a look.
Cheers :smile:
Thanks @althonos,
indeed using a ThreadPoolExecutor
seems to work. This is very interesting, since I always thought that CPU bound tasks could not effectively parallelized by Python threads. I guess some Cython magic?
Anyways, thanks a lot for this! I'll test this a bit and maybe re-active the parallel gene prediction in Bakta which indeed saves a few tens of seconds in meta
mode on a draft genome.
Indeed, Cython lets you declare code that runs in no-GIL mode, but for that you need to have code that doesn't interact with the Python interpreter in any way during these sections -- this is the case in Pyrodigal because the whole computation uses C data structures (from the Prodigal code) and only wraps the results at the very end of the computation 🙂
I'll try to have a look at the process pool eventually but I don't have ideas as to what could be wrong right now !
Ah, I see. Thanks for the explanation. So far, everything seem to work using the ThreadPool which - of course - is faster on its own by avoiding all the back-and-forth copying of data, external process overhead, etc. Thanks again. I'll close this for now.
By the way, it looks like the error with ProcessPool
was indeed caused by a pickling bug; I'll make a patch just in case, but it's quite likely that a ThreadPool
is still faster.
Thanks a lot for the confirmation and quick fix. Indeed, the ThreadPool
is much faster.
Hi @althonos, thanks for the recent
3.0
version and all its improvements. I'm currently working on a patch to bring this into Bakta.Primary, I wanted to implement a multi-threaded version of this as suggested in https://github.com/althonos/pyrodigal#-thread-safety. However, in
meta
mode, I always run into aconcurrent.futures.process.BrokenProcessPool
. I cannot 100% rule out that maybe there is something wrong on our site, but after playing around and running a minimal example, I wanted to let you know - just in case there might be bug lurking in Pyrodigal.So, using a small plasmid as input, the following minimal example is working as expected:
However, If i switch to a parallel setup, this is not working anymore:
Giving this stacktrace:
So this only occurs in
meta
mode; the parallel implementation works fine in parallel on larger sequences running in default non-meta mode. Any ideas why this is not working? Thanks a lot in advance!Best regards Oliver