althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
128 stars 12 forks source link

CPUs not being utilized #41

Closed EvanKomp closed 1 year ago

EvanKomp commented 1 year ago

Hello!

I have a use case where I am running hmmscan for a number of proteins against Pfam.

I have prefetched Pfam by pressing it and then loading an OptimizedProfileBlock, to be my targets.

I then loaded my sequences into DigitalSequenceBlock.

I am not very memory limited, so I figured prefetching the hmms and loading the sequences as a block (as opposed to SequenceFile) was the right strat, LMK if not.

Regardless, when I specify cpus=32, I notice only one cpu (100%CPU) utilization using top.

Thanks for any help!

althonos commented 1 year ago

Hi!

Indeed you are right, in a case where you are not memory limited pre-fetching everything should give you better performance.

Do you have a small number of proteins to scan? If so, the overhead caused by the scanning may outweigh the time actually spent in parallel sections. There is also the case that HMMs need to be reconfigured for every new query sequence, so to prevent concurrent accesses the OptimizedProfileBlock has one lock per HMM. In a case where you don't have a lot of sequences, I could see the threads starving each other?

Also, could you try with hmmsearch just as a test?

EvanKomp commented 1 year ago

@althonos

Thanks for the reply. I tried with hmmsearch and still got only 100% cpu usage, which makes me think it has something to do with the resources on my end. For further context, I have a node on SLURM cluster with N cpus, I asked for 32. The cpu time is also linear with the run time, indicating that only a single cpu is being used. I followed python debugger far enough into the hmmsearch function to see that _cpus=32 in the local namespace.

Further context, I have millions of proteins, but have been running on chunks of about 50k proteins at a time trying to get performance before I scale up to closer to memory limit. It is taking about 20 minutes to run 50k, Has anyone else had problems on a cluster? I can try to run the code on a my desktop to see if it will muster more usage.

Evan

EvanKomp commented 1 year ago

Update - hmmsearch and hmmscan can both use all (8) of the cores on my desktop, reduces search time down to 4 and 7 minutes/50k chunk, respectively. Not sure why only one core is being used inside of an HPC job. Any ideas? Will play around and report back if I can find something.

EvanKomp commented 1 year ago

Update update: It was my SLURM configuration. I had ntasks=32.

Changing to ntasks=1 and n-cpus-per-task=32 Allowed pyhmmer to see the processors.

Thanks for the time and the great software!

althonos commented 1 year ago

Cool, happy to hear this! :)

althonos commented 1 year ago

As a side note, if you're really trying to optimize for runtime, consider using hmmsearch instead of hmmscan, which will give you the same results as long as you properly set the Z and domZ parameters. hmmsearch has the advantage that each HMM is reconfigured only once, where in hmmscan each HMM is reconfigured once per query, which can quickly add up. It requires a bit more code afterwards to disentangle the results, but it's worth it for performance :wink:

EvanKomp commented 1 year ago

@althonos

While I have you - If I understand correctly Z is the number of subjects searched. I see how that would be different for search vs scan, and how it could be manually set in a "smart" way. I am a bit more confused w.r.t. domZ. If I am reading correctly, it is the number of subjects that pass the (E value?) threshold for a single query. The hmmer manual is only so-so helpful.

Specifically, let's say I am running Pfam vs dataset A of size A and dataset B of size B. Correct me if I am wrong:

  1. In the case of hmmscan, the output E values will be normalized the same way, since the subjects are HMMs in Pfam, thus no manual processing required.
  2. In the case of hmmsearch, Z and domZ are calculated with respect to A or B, thus to compare E values of the two different runs, both parameters need to be set.

Thanks for any clarity, Evan

althonos commented 1 year ago

So, the E-value you get in HMMER is actually P-value corrected for multiple testing. The Z parameter is the number of comparison targets, meaning that it is the number of HMMs for hmmscan, and the number of sequences for hmmsearch. This means that if you manually set Z=19632 (the number of HMMs in Pfam 35, if you're using Pfam 35), you can use either hmmsearch or hmmscan and get the same hit P-values.

In theory for the domain P-values this should be the same with domZ, however even I am not entirely sure how domZ is computed if not provided (contrary to Z). However, domZ is only useful when computing domain P-values, so if you filter your hits based on hit P-values this is not so relevant.