Merck / deepbgc

BGC Detection and Classification Using Deep Learning
https://doi.org/10.1093/nar/gkz654
MIT License
123 stars 27 forks source link

hmmscan runs on only one core #30

Closed thinkgenome closed 4 years ago

thinkgenome commented 4 years ago

Can the cpu utilisation be improved w.r.t hmmscan step, to accelerate the analysis of metagenome data?

prihoda commented 4 years ago

AFAIK hmmscan is able to use all cores by default:

       --cpu <n>
              Set the number of parallel worker threads to <n>.  By default, HMMER sets  this  to
              the  number of CPU cores it detects in your machine

I also tried a quick test and adding --cpu 8 has no effect on performance.

Keep the issues coming, we are very happy to get feedback 👍

thinkgenome commented 4 years ago

Thanks for prompt reply @prihoda So what changes in deepbgc pipeline commandline you suggest as with default commandline hmmscan only uses one core at a time for each sequence?

prihoda commented 4 years ago

I am guessing this is due to limits in hmmscan implementation, for me it also uses around 1-2 cores.

As per the hmmscan docs, it looks like you can control the CPU limit using an env var:

You can  also  control  this  number  by  setting  an  environment  variable, HMMER_NCPU.

So you can try setting that, but I think it unfortunately won't make a difference.

danudwary commented 4 years ago

Can hmmscan be replaced with hmmsearch here? This is the compute bottleneck, as far as I can tell, and prevents scaling up my deepBGC usage.

This is old, but I think it still applies to hmmer 3.3. https://cryptogenomicon.org/2011/05/27/hmmscan-vs-hmmsearch-speed-the-numerology/