althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
130 stars 12 forks source link

Work with multiprocessing #21

Closed jpjarnoux closed 2 years ago

jpjarnoux commented 2 years ago

Hi, I would work with multiple CPU, but I don't understand how to give more than one CPU to pyhmmer. So I tried to use multiprocessing packages, but pyhmmer object are non-trivial __cinit__. Example : multiprocessing.pool.MaybeEncodingError: Error sending result: '<pyhmmer.plan7.TopHits object at 0x561959114ad0>'. Reason: 'TypeError('no default __reduce__ due to non-trivial __cinit__')'

Could you give me an example to use pyhmmer with more than one CPU if it's possible ? Thanks

althonos commented 2 years ago

Hi @jpjarnoux

pyhmmer releases the GIL where applicable, so you don't have to use processes to get it to work, threads will work efficiently as well. Try using multiprocessing.pool.ThreadPool instead of multiprocessing.pool.Pool, this should already give you some decent performance (or use pyhmmer.hmmsearch which does it for you). Otherwise, I'll try adding pickle support to TopHits when I have some time.

jpjarnoux commented 2 years ago

Okay thanks it's what I was reading. However if I have 16 cpu available it's look like they are not fully used. Maybe it's possible to say it to GLI ? I will try your advice tomorrow and keep you in touch. Thanks

althonos commented 2 years ago

Then it really depends what you are trying to achieve, I cannot really guess without seeing your usecase, perhaps you don't have enough target sequences to make complete use of all your CPUs.

In my benchmark, I also noticed that HMMER was having a hard time using more than the number of physical CPUs because it's using too many SIMD registers to benefit from hyperthreading. It could be that you're on a machine with 8 physical / 16 logical cores; in that case, you'll see no improvement using 16 jobs instead of 8.

jpjarnoux commented 2 years ago

Sorry, I should explain more clearly what I'm doing. I'm trying to annotate proteins with 4000 thousand HMM. I have one file by HMM. Before I created one DB with all my HMM. Now, to be more efficient, I'm trying to split with multiple DB and to concatenate results. I keep you in touch, thank you.

jpjarnoux commented 2 years ago

Hi I finally used the concurrent.futures.ThreadPoolExecutor and everything work very efficiently. Thanks for your help.

althonos commented 2 years ago

Happy to hear this!