Recommended settings for searching a small number of profiles against a large database

seanrjohnson commented 2 years ago

I'm using pyhmmer to search a small number of profiles (for example custom profiles, not from Pfam) against large databases, such as UniProt. Here are a few things I have questions about:

Because the databases are too large to fit into memory, I am chunking the sequences into groups of 10,000 proteins at a time, and making multiple calls to hmmsearch. In order to get the e-value calculations right, I have to parse the protein file multiple times. First to count the number of sequences, to get the proper Z value. Then a second time to pass sequences in batches to hmmsearch, along with the pre-calculated Z, otherwise it will use Z=10,000. Is there a way to hold off calculating e-values until after the last batch of sequences is passed, so I only have to make one pass through the input (this seems like it's probably not possible)?
As mentioned in the documentation, the parallelization strategy is not optimal when the number of profiles is less than the number of threads. When I run with a single hmm, I'm seeing usage of a single core. Are there any plans to implement a parallelization strategy that would be more appropriate for these kinds of workflows? If you don't have time, but can point me in the right direction, I might be able to submit a pull request to do this.

I like using pyhmmer instead of the hmmsearch executable for this, because my input files are actually genbank genome files with CDS annotations, so with pyhmmer I can read the files, translate the CDSs and run hmmsearch all in memory without ever writing the protein sequences to disk, which is great.

althonos commented 2 years ago

Hi @seanrjohnson !

Yes, unfortunately as the moment I didn't get around to actually working on balancing the workload for (few profiles) VS (many sequences), this is planned for when I can work on PyHMMER in more details.

For the E-value calculation, the HMMER code has support for merging TopHits obtained from different runs, typically when you have predictions obtained across multiple runs. I removed it because it was causing memory safety issues, but I will try to add it back so that you can get your results with a single pass.

seanrjohnson commented 2 years ago

Thinking about it some more the TopHits thing wouldn't actually help my particular case of having for really large sequence databases, because then I'd have to keep my entire input file in memory and/or store large TopHits objects in memory. I think what I'll end up doing is caching the Z value in an auxiliary text file, so the first pass to calculate Z only has to be run on the first query against the database.

The few-profiles vs many-sequences optimization would be great though, I'm looking forward to that, whenever you have time to implement it.

The current version of pyhmmer is actually working pretty well for me, just could be a bit faster.

althonos commented 2 years ago

Thinking about it some more the TopHits thing wouldn't actually help my particular case of having for really large sequence databases, because then I'd have to keep my entire input file in memory and/or store large TopHits objects in memory.

No, actually, it would, because the E-values are computed on demand, not during a run. Basically, you would run a search for same domain on different chunks (N=10,000 for instance), and get a TopHits for each. Then you can merge these TopHits together, which will cause the Z value to be summed as well: you get one large TopHits with all your domains that now has the correct Z and consequently the correct E-values for all hits.

seanrjohnson commented 2 years ago

Ah, I see, yes, that could be helpful.

althonos commented 2 years ago

Okay, I've brought back the hits merging, this time with safer semantics (copying data to avoid use-after-free and that kind of things). This means you should be able to do to get some multithreading + avoid parsing twice. If you want to try it, install from the develop branch:

$ pip install -U https://github.com/althonos/pyhmmer/archive/refs/heads/develop.zip

Then, here's a small snippet on how to use it to process several chunks in parallel (of course, adapt it to your example). It's sub-optimal (a proper implementation would recycle the pipeline / do clean load balancing / cache the optimized HMM between runs), but that's a basis for what I'll probably end up doing eventually:

import multiprocessing.pool
import itertools
import pyhmmer

# a helper to group an iterator into block of size `block_length`
def blocks(iterator, block_length=1000):
    while True:
        block = list(itertools.islice(iterator, block_length))
        if not block: 
            break
        yield block

# load the profile 
with pyhmmer.plan7.HMMFile("tests/data/hmms/txt/PF02826.hmm") as hmm_file:
    hmm = hmm_file.read()

# iterator over sequences
with pyhmmer.easel.SequenceFile("tests/data/seqs/938293.PRJEB85.HG003687.faa") as seqs_file:
    seqs_file.set_digital(hmm.alphabet)

    # the callable to execute in the pool
    def process(chunk):
        pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
        return pipeline.search_hmm(hmm, chunk)

    # use a threadpool to run several searches in parallel
    with multiprocessing.pool.ThreadPool() as pool:   
        all_hits = pool.map(process, blocks(seqs_file))
        merged_hits = pyhmmer.plan7.TopHits().merge(*all_hits)

print("Found", len(merged_hits), "hits") # hits in merged_hits should have the correct e-value!

Of course, you'll have to adapt it to fit your needs, but as long as you manage to pass an iterable to pool.map then you'll be able to process several chunks in parallel without having to load the entirety of your sequence databases in memory. Please let me know if this works well enough and I'll wrap it with some other changes into a new release.

seanrjohnson commented 2 years ago

Thanks. I'll try it out and get back to you again within a week.

I did some profiling today and found out that the bottleneck in my program is actually the Biopython genbank parser, which is far slower than the actual hmmsearch with pyhmmer! So, I guess I need to figure out how to parallelize the parsing steps too (or find or write a more performant genbank parser...).

althonos commented 2 years ago

Ah, true, the GenBank parser in Biopython is painfully slow. I've been using Rust with gb-io to process large amount of data (cut the runtime from 1 hour to 8 minutes on my 50GB of sequences), so one day I may end up writing a Python wrapper for that! Maybe you can have a look at scikit-bio? I have no clue about the performance though but it can't be much worse than with Bio.SeqIO...

seanrjohnson commented 2 years ago

I'll check out scikit-bio. I'd like to do more development in Go (maybe using the poly library for genbank parsing https://github.com/TimothyStiles/poly), but there are just so many cool libraries, (like pyhmmer) that only work with Python.

seanrjohnson commented 2 years ago

Just to update you. The scikit-bio genbank parser is actually about twice as slow as the Biopython parser!

On my machine and dataset, parsing a genbank file with Biopython takes about the same amount of time as searching 25 hmm profiles with pyhmmer (or the hmmsearch executable) on 1 core. Using BioPython to translate CDSs is actually slower than parsing genbank files, so including translations in the genbank file speeds up the data preparation time by a little more than double (already accounted for in the stated numbers above).

It is possible to parallelize reading genbank files, for instance by splitting the input file on // lines and passing those chunks to different subprocesses, then using StringIO to get them into the BioPython parser (I also tried passing file offsets and then re-reading the data from the file in each process, which worked slightly faster).

The problem is that passing SeqRecords between processes is slow. In fact it seems to be just as slow to read a SeqRecord from a Queue as to read it from a file. So I can't parse all the seqrecords on different threads, then collect them and make one call to hmmsearch, instead I'd have to call hmmsearch from each parser thread and collect the output (which for various reasons, I don't want to do).

So, basically, I haven't tried your new code yet because I don't yet actually have a scenario where it would be helpful. But I might get there eventually by playing with different parallel parsing schemes or indexing schemes.

althonos commented 2 years ago

Just to update you. The scikit-bio genbank parser is actually about twice as slow as the Biopython parser!

Oh crap, sorry to hear that. I really need to start looking into binding the Rust parser then!

The problem is that passing SeqRecords between processes is slow.

I'd assume it is, Python pickles everything and unless you have a particularly good implementation there is a lot of copying going on.

Have you considered pre-translating and splitting your input sequences? I guess it's going to take some disk space so I don't know how feasible it is on your side, but if you do a first pass of extracting the CDS from the GenBank files into several FASTA files (e.g. one FASTA file for every 10,000 sequence or so), then afterwards you could use the pyhmmer.easel.SequenceFile API to load the chunks (and this should be much faster). Then you could load and process each file independently using a ThreadPool, and merge the results at the end.

seanrjohnson commented 2 years ago

Yes, I think pre-translating the input sequences is probably the fastest way to go for many scenarios.

The complication for my program is that the output is actually also a genbank file, but with the hmmer hits mapped as features in the genbank file. Basically, you give it a hmm profile, it searches some large set of metagenome contigs and writes a genbank file showing where on the contig your hmm profile(s) align.

I'm thinking of making peptide fasta files where the sequence names contain the file offsets in the genbank file. So when there is a hit, the program can just seek to that particular record, read it, add the hmmer annotation and write it. I think something like that would work well for cases where most metagenome contigs don't have any hits to the hmm profile. For cases where most of them do, then one way or another, I'll have to parse them so I can add the new hmmer-derived annotations.

alienzj commented 2 years ago

Okay, I've brought back the hits merging, this time with safer semantics (copying data to avoid use-after-free and that kind of things). This means you should be able to do to get some multithreading + avoid parsing twice. If you want to try it, install from the develop branch:
$ pip install -U https://github.com/althonos/pyhmmer/archive/refs/heads/develop.zip
Then, here's a small snippet on how to use it to process several chunks in parallel (of course, adapt it to your example). It's sub-optimal (a proper implementation would recycle the pipeline / do clean load balancing / cache the optimized HMM between runs), but that's a basis for what I'll probably end up doing eventually:
import multiprocessing.pool
import itertools
import pyhmmer

# a helper to group an iterator into block of size `block_length`
def blocks(iterator, block_length=1000):
    while True:
        block = list(itertools.islice(iterator, block_length))
        if not block: 
            break
        yield block

# load the profile 
with pyhmmer.plan7.HMMFile("tests/data/hmms/txt/PF02826.hmm") as hmm_file:
    hmm = hmm_file.read()

# iterator over sequences
with pyhmmer.easel.SequenceFile("tests/data/seqs/938293.PRJEB85.HG003687.faa") as seqs_file:
    seqs_file.set_digital(hmm.alphabet)

    # the callable to execute in the pool
    def process(chunk):
        pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
        return pipeline.search_hmm(hmm, chunk)

    # use a threadpool to run several searches in parallel
    with multiprocessing.pool.ThreadPool() as pool:   
        all_hits = pool.map(process, blocks(seqs_file))
        merged_hits = pyhmmer.plan7.TopHits().merge(*all_hits)

print("Found", len(merged_hits), "hits") # hits in merged_hits should have the correct e-value!
Of course, you'll have to adapt it to fit your needs, but as long as you manage to pass an iterable to pool.map then you'll be able to process several chunks in parallel without having to load the entirety of your sequence databases in memory. Please let me know if this works well enough and I'll wrap it with some other changes into a new release.

Dear @althonos

What's the difference between this part and pyhmmer.hmmsearch(hmm, pep, cpu=128) ? and how about the pyhmmer.LongTargetsPipeline ?

Sorry for ask so many questions, I am also very interested about the pyhmmer performance.

althonos commented 2 years ago

Hi @alienzj ,

The difference between this and a simple run of hmmsearch is that hmmsearch requires your entire sequence database to be loaded in memory, which was not possible here. The code I wrote will produce the same results, but by running on chunks of the target sequence database, and merging results at the very end.

LongTargetsPipeline is not about performance, it's the pipeline being used for doing DNA matches in the nhmmer binary. As its name implies, it supports long targets by using matching domains with a sliding window instead of processing the whole sequence at once (which is only supported up to 100,000 residues in HMMER, and a genome almost always exceeds that),

alienzj commented 2 years ago

@althonos I see, for large sequence, I will try the above method. Thanks !

althonos commented 1 year ago

Release v0.7.0 now has an hmmscan implementation, which is suited for doing what @seanrjohnson originally asked, so I'll be closing this issue.

althonos / pyhmmer

Recommended settings for searching a small number of profiles against a large database #16