Resource contention from independently run, unrelated hmmscan processes

Hi all, I'm noticing a performance problem with hmmscan run on some simple bacterial genomic protein sets. With one thread it does fine. But even when given independent processes on independent threads from independent shells... this happens:

I have 256 threads (128 cores), and am using only 64 (--cpu 1) processes of hmmscan. These are all different, separate, independent instances that should not be talking to each other or even aware of each other's existence. This is a screenshot from htop showing the resource utilization. When running multiples like this I see very low CPU use in the "user" category and very high use in the "system" (aka wasted) category. That's where the red lines are coming from in the htop view I posted -- that's wasted processing.

Some numbers:

On one genome, hmmscan takes about 4 minutes to run.
But scale up to 64 separate instances, and each takes 60 minutes!

That's just weird and seems like it shouldn't be happening. What's going on? Is hmmscan locking something on the system? If so, couldn't it copy what it needs to its own local thread and process it there? It seems one instance is somehow negatively affecting other instances.

Or does it already do this and we are just RAM bandwidth starved in general here? (Like, does it take each query and separately, individually, take that one query through the whole entire 1.5GB database and then return for the next query?)

Maybe unrelated, but what is the I/O thread actually doing? I see each instance eating up more than 100% CPU, but what benefits does it actually bring? Why can't we just "fread" the file into RAM? Or mmap() it? Honestly these aren't big files -- Pfam-A is only 1.5 gigs, and each entire genome full of queries is a couple of MBs.

Would be eager to figure out what's happening here. (I would be even more eager to switch to hmmsearch, but the e-values it produces, from what I understand, are not assigned per sequence but in some nebulous context of the whole genome, which is a huge no-no when trying to build a deterministic model around the presence/absence of certain annotations at a universal e-value cutoff, as it produces different e-values for the same exact protein in different contexts, which in turn makes it impossible to subselect genes beforehand, or even directly compare annotations across a complete and incomplete genome of the identical strain, or related strains with differing pangenomic content, etc. Maybe the easiest solution is to just add the hmmscan table output to the hmmsearch program as an option, e.g. "--query-centric" and make hmmscan an alias to it).

Thanks for any insights and/or discussion!

PS: for reference, 128 instances of hmmsearch --cpu 1 Obviously a lot nicer-looking and much less red despite double the threads used!

To make sure I understand correctly, you're running 64 hmmscan jobs where the HMM database (Pfam-A) is the same for all jobs, and the sequence file varies from run to run. Is that correct?

If that's the case, it's fairly likely that you're being I/O limited by some combination of the time required to read the HMM database from disk and the time required to parse it into the data structures we use internally. Hmmscan reads the HMM database from disk once for each sequence in the sequence file, so the total amount of data read from disk in a run is something like sequence file size + HMM file size * # of sequences in file. Also, converting a text-format HMM into our data structures takes a moderate amount of computation, making things worse. (Yes, this method is inefficient in the case where a run searches many sequences, but it lets us minimize memory use in the common case where each run targets a single sequence.)

For something like what you're doing, you should see better performance if you turn things around and use hmmsearch, like you mentioned. Hmmsearch reads the target database once for each HMM in the query HMM file, but an HMM takes about 200x the disk space of a sequence of the same length, and more computation per element to convert into internal data structures. As you mention, the challenge here is making the program calculate the correct e-values. You can do that with the -Z flag to hmmsearch, which sets the number of comparisons used when calculating e-values.

For the hmmscans you're running, the number of comparisons = the number of HMMs in the Pfam-A database, and each sequence is treated as an independent search. So, if you switch to hmmsearch and use -Z to set the number of comparisons to that value, you'll get the same e-values as you do with hmmscan.

Given all that, you can probably minimize the bandwidth bottleneck by merging all of your sequences into one sequence database, splitting your HMM database into 64 pieces, and doing hmmsearch searches of each piece of your HMM database against your complete sequence database.

Hope that helps!

-Nick

On Wed, Apr 28, 2021 at 8:29 PM Gabriel Al-Ghalith @.***> wrote:

Hi all, I'm noticing a performance problem with hmmscan run on some simple bacterial genomic protein sets. With one thread it does fine. But even when given independent processes on independent threads from independent shells... this happens:

[image: image] https://user-images.githubusercontent.com/8050438/116487157-4fd95480-a85d-11eb-987e-d29e3234a62b.png

I have 256 threads (128 cores), and am using only 64 (--cpu 1) processes of hmmscan. These are all different, separate, independent instances that should not be talking to each other or even aware of each other's existence. This is a screenshot from htop showing the resource utilization. When running multiples like this I see very low CPU use in the "user" category and very high use in the "system" (aka wasted) category. That's where the red lines are coming from in the htop view I posted -- that's wasted processing.

Some numbers:

On one genome, hmmscan takes about 4 minutes to run.

But scale up to 64 separate instances, and each takes 60 minutes!

That's just weird and seems like it shouldn't be happening. What's going on? Is hmmscan locking something on the system? If so, couldn't it copy what it needs to its own local thread and process it there? It seems one instance is somehow negatively affecting other instances.

Or does it already do this and we are just RAM bandwidth starved in general here? (Like, does it take each query and separately, individually, take that one query through the whole entire 1.5GB database and then return for the next query?)

Maybe unrelated, but what is the I/O thread actually doing? I see each instance eating up more than 100% CPU, but what benefits does it actually bring? Why can't we just "fread" the file into RAM? Or mmap() it? Honestly these aren't big files -- Pfam-A is only 1.5 gigs, and each entire genome full of queries is a couple of MBs.

Would be eager to figure out what's happening here. (I would be even more eager to switch to hmmsearch, but the e-values it produces, from what I understand, are not assigned per sequence but in some nebulous context of the whole genome, which is a huge no-no when trying to build a deterministic model around the presence/absence of certain annotations at a universal e-value cutoff. Maybe the easiest solution is to just add the hmmscan table output to the hmmsearch program as an option, e.g. "--query-centric").

Thanks for any insights and/or discussion!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/issues/240, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZHC36Q3XE4ULV7AFLTTLCR6TANCNFSM43YC2UVQ .

Thanks, your understanding is correct!

I should mention that in the hmmscan case, I'm using the binary version of the hmm file (converted via hmmpress) and storing it on ramdisk so there is no disk I/O involved, nor the overhead of parsing text.

That said, your comment about the -Z flag is absolutely and incredibly appreciated. Pfam-A supposedly contains 19,179 entries (v34.0) according to the web site (http://pfam.xfam.org/). I'll try using this number (unless their definition of an "entry" differs from our definition of an "HMM", e.g. not a 1:1 mapping).

I am currently running hmmsearch standard (no splitting or -Z) and am seeing a speed of roughly 2,000 genomes annotated per hour (roughly 1 genome is fully annotated every 2 seconds). I would be quite eager to improve this speed to reach 1 genome per second, since I have over a million prokaryotic genomes to get through. In your experience, should I recompile to disable the I/O thread (e.g. disable all threading) and run on all hyperthreads, or keep the default "I/O thread" as-is and run half the number of instances as threads (i.e. one run per physical core)?

I'll also try the splitting trick, but have a question there: How can I split a pre-existing hmm database? Is this in the manual? I assume once I do I'll need to set Z to whatever number of HMMs each one contains. Then merging the results is as simple as concatenating the --tblout files (after removing the comments).

Thanks, Gabe

I don't have any experience disabling the I/O thread, so can't comment much there.

For splitting HMM files, the hmmfetch utility may help. If you can construct text files containing the name of the HMMs in each split, you can then pass that to hmmfetch to extract those HMMs into separate files.

-Nick

On Thu, Apr 29, 2021 at 3:21 PM Gabriel Al-Ghalith @.***> wrote:

Thanks, your understanding is correct!

I should mention that I'm using the binary version of the hmm file (converted via hmmpress) and storing it on ramdisk so there is no disk I/O involved, nor the overhead of parsing text.

That said, your comment about the -Z flag is absolutely and incredibly appreciated. Pfam-A supposedly contains 19,179 entries (v34.0) according to the web site (http://pfam.xfam.org/). I'll try using this number (unless their definition of an "entry" differs from our definition of an "HMM", e.g. not a 1:1 mapping).

I am currently running hmmsearch standard (no splitting or -Z) and am seeing a speed of roughly 2,000 genomes annotated per hour (roughly 1 genome is fully annotated every 2 seconds). I would be quite eager to improve this speed to reach 1 genome per second, since I have over a million prokaryotic genomes to get through. In your experience, should I recompile to disable the I/O thread (e.g. disable all threading) and run on all hyperthreads, or keep the default "I/O thread" as-is and run half the number of instances as threads (i.e. one run per physical core)?

I'll also try the splitting trick, but have a question there: How can I split a pre-existing hmm database? Is this in the manual? I assume once I do I'll need to set Z to whatever number of HMMs each one contains. Then merging the results is as simple as concatenating the --tblout files (after removing the comments).

Thanks, Gabe

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/issues/240#issuecomment-829523197, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZBLZJULJGXDPO7JSOLTLGWVFANCNFSM43YC2UVQ .

Excellent, thanks. Yes, the pfam docs have a name for each. The manual (section 8, page 107) details the ascii format of the existing hmm files as well, which can be parsed for the names to feed hmmfetch (or, if you're particularly bold, to parse the ascii hmm file itself!).

EddyRivasLab / hmmer

Resource contention from independently run, unrelated hmmscan processes #240