Closed seanrjohnson closed 1 year ago
Hi @seanrjohnson !
Yes, unfortunately as the moment I didn't get around to actually working on balancing the workload for (few profiles) VS (many sequences), this is planned for when I can work on PyHMMER in more details.
For the E-value calculation, the HMMER code has support for merging TopHits
obtained from different runs, typically when you have predictions obtained across multiple runs. I removed it because it was causing memory safety issues, but I will try to add it back so that you can get your results with a single pass.
Thinking about it some more the TopHits thing wouldn't actually help my particular case of having for really large sequence databases, because then I'd have to keep my entire input file in memory and/or store large TopHits objects in memory. I think what I'll end up doing is caching the Z value in an auxiliary text file, so the first pass to calculate Z only has to be run on the first query against the database.
The few-profiles vs many-sequences optimization would be great though, I'm looking forward to that, whenever you have time to implement it.
The current version of pyhmmer is actually working pretty well for me, just could be a bit faster.
Thinking about it some more the TopHits thing wouldn't actually help my particular case of having for really large sequence databases, because then I'd have to keep my entire input file in memory and/or store large TopHits objects in memory.
No, actually, it would, because the E-values are computed on demand, not during a run. Basically, you would run a search for same domain on different chunks (N=10,000 for instance), and get a TopHits
for each. Then you can merge these TopHits
together, which will cause the Z
value to be summed as well: you get one large TopHits
with all your domains that now has the correct Z
and consequently the correct E-values for all hits.
Ah, I see, yes, that could be helpful.
Okay, I've brought back the hits merging, this time with safer semantics (copying data to avoid use-after-free and that kind of things). This means you should be able to do to get some multithreading + avoid parsing twice. If you want to try it, install from the develop
branch:
$ pip install -U https://github.com/althonos/pyhmmer/archive/refs/heads/develop.zip
Then, here's a small snippet on how to use it to process several chunks in parallel (of course, adapt it to your example). It's sub-optimal (a proper implementation would recycle the pipeline / do clean load balancing / cache the optimized HMM between runs), but that's a basis for what I'll probably end up doing eventually:
import multiprocessing.pool
import itertools
import pyhmmer
# a helper to group an iterator into block of size `block_length`
def blocks(iterator, block_length=1000):
while True:
block = list(itertools.islice(iterator, block_length))
if not block:
break
yield block
# load the profile
with pyhmmer.plan7.HMMFile("tests/data/hmms/txt/PF02826.hmm") as hmm_file:
hmm = hmm_file.read()
# iterator over sequences
with pyhmmer.easel.SequenceFile("tests/data/seqs/938293.PRJEB85.HG003687.faa") as seqs_file:
seqs_file.set_digital(hmm.alphabet)
# the callable to execute in the pool
def process(chunk):
pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
return pipeline.search_hmm(hmm, chunk)
# use a threadpool to run several searches in parallel
with multiprocessing.pool.ThreadPool() as pool:
all_hits = pool.map(process, blocks(seqs_file))
merged_hits = pyhmmer.plan7.TopHits().merge(*all_hits)
print("Found", len(merged_hits), "hits") # hits in merged_hits should have the correct e-value!
Of course, you'll have to adapt it to fit your needs, but as long as you manage to pass an iterable to pool.map
then you'll be able to process several chunks in parallel without having to load the entirety of your sequence databases in memory. Please let me know if this works well enough and I'll wrap it with some other changes into a new release.
Thanks. I'll try it out and get back to you again within a week.
I did some profiling today and found out that the bottleneck in my program is actually the Biopython genbank parser, which is far slower than the actual hmmsearch with pyhmmer! So, I guess I need to figure out how to parallelize the parsing steps too (or find or write a more performant genbank parser...).
Ah, true, the GenBank parser in Biopython is painfully slow. I've been using Rust with gb-io
to process large amount of data (cut the runtime from 1 hour to 8 minutes on my 50GB of sequences), so one day I may end up writing a Python wrapper for that! Maybe you can have a look at scikit-bio? I have no clue about the performance though but it can't be much worse than with Bio.SeqIO
...
I'll check out scikit-bio. I'd like to do more development in Go (maybe using the poly library for genbank parsing https://github.com/TimothyStiles/poly), but there are just so many cool libraries, (like pyhmmer) that only work with Python.
Just to update you. The scikit-bio genbank parser is actually about twice as slow as the Biopython parser!
On my machine and dataset, parsing a genbank file with Biopython takes about the same amount of time as searching 25 hmm profiles with pyhmmer (or the hmmsearch executable) on 1 core. Using BioPython to translate CDSs is actually slower than parsing genbank files, so including translations in the genbank file speeds up the data preparation time by a little more than double (already accounted for in the stated numbers above).
It is possible to parallelize reading genbank files, for instance by splitting the input file on //
lines and passing those chunks to different subprocesses, then using StringIO to get them into the BioPython parser (I also tried passing file offsets and then re-reading the data from the file in each process, which worked slightly faster).
The problem is that passing SeqRecords between processes is slow. In fact it seems to be just as slow to read a SeqRecord from a Queue as to read it from a file. So I can't parse all the seqrecords on different threads, then collect them and make one call to hmmsearch, instead I'd have to call hmmsearch from each parser thread and collect the output (which for various reasons, I don't want to do).
So, basically, I haven't tried your new code yet because I don't yet actually have a scenario where it would be helpful. But I might get there eventually by playing with different parallel parsing schemes or indexing schemes.
Just to update you. The scikit-bio genbank parser is actually about twice as slow as the Biopython parser!
Oh crap, sorry to hear that. I really need to start looking into binding the Rust parser then!
The problem is that passing SeqRecords between processes is slow.
I'd assume it is, Python pickles everything and unless you have a particularly good implementation there is a lot of copying going on.
Have you considered pre-translating and splitting your input sequences? I guess it's going to take some disk space so I don't know how feasible it is on your side, but if you do a first pass of extracting the CDS from the GenBank files into several FASTA files (e.g. one FASTA file for every 10,000 sequence or so), then afterwards you could use the pyhmmer.easel.SequenceFile
API to load the chunks (and this should be much faster). Then you could load and process each file independently using a ThreadPool
, and merge the results at the end.
Yes, I think pre-translating the input sequences is probably the fastest way to go for many scenarios.
The complication for my program is that the output is actually also a genbank file, but with the hmmer hits mapped as features in the genbank file. Basically, you give it a hmm profile, it searches some large set of metagenome contigs and writes a genbank file showing where on the contig your hmm profile(s) align.
I'm thinking of making peptide fasta files where the sequence names contain the file offsets in the genbank file. So when there is a hit, the program can just seek to that particular record, read it, add the hmmer annotation and write it. I think something like that would work well for cases where most metagenome contigs don't have any hits to the hmm profile. For cases where most of them do, then one way or another, I'll have to parse them so I can add the new hmmer-derived annotations.
Okay, I've brought back the hits merging, this time with safer semantics (copying data to avoid use-after-free and that kind of things). This means you should be able to do to get some multithreading + avoid parsing twice. If you want to try it, install from the
develop
branch:$ pip install -U https://github.com/althonos/pyhmmer/archive/refs/heads/develop.zip
Then, here's a small snippet on how to use it to process several chunks in parallel (of course, adapt it to your example). It's sub-optimal (a proper implementation would recycle the pipeline / do clean load balancing / cache the optimized HMM between runs), but that's a basis for what I'll probably end up doing eventually:
import multiprocessing.pool import itertools import pyhmmer # a helper to group an iterator into block of size `block_length` def blocks(iterator, block_length=1000): while True: block = list(itertools.islice(iterator, block_length)) if not block: break yield block # load the profile with pyhmmer.plan7.HMMFile("tests/data/hmms/txt/PF02826.hmm") as hmm_file: hmm = hmm_file.read() # iterator over sequences with pyhmmer.easel.SequenceFile("tests/data/seqs/938293.PRJEB85.HG003687.faa") as seqs_file: seqs_file.set_digital(hmm.alphabet) # the callable to execute in the pool def process(chunk): pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet) return pipeline.search_hmm(hmm, chunk) # use a threadpool to run several searches in parallel with multiprocessing.pool.ThreadPool() as pool: all_hits = pool.map(process, blocks(seqs_file)) merged_hits = pyhmmer.plan7.TopHits().merge(*all_hits) print("Found", len(merged_hits), "hits") # hits in merged_hits should have the correct e-value!
Of course, you'll have to adapt it to fit your needs, but as long as you manage to pass an iterable to
pool.map
then you'll be able to process several chunks in parallel without having to load the entirety of your sequence databases in memory. Please let me know if this works well enough and I'll wrap it with some other changes into a new release.
Dear @althonos
What's the difference between this part and pyhmmer.hmmsearch(hmm, pep, cpu=128)
?
and how about the pyhmmer.LongTargetsPipeline
?
Sorry for ask so many questions, I am also very interested about the pyhmmer performance.
Hi @alienzj ,
The difference between this and a simple run of hmmsearch
is that hmmsearch
requires your entire sequence database to be loaded in memory, which was not possible here. The code I wrote will produce the same results, but by running on chunks of the target sequence database, and merging results at the very end.
LongTargetsPipeline
is not about performance, it's the pipeline being used for doing DNA matches in the nhmmer
binary. As its name implies, it supports long targets by using matching domains with a sliding window instead of processing the whole sequence at once (which is only supported up to 100,000 residues in HMMER, and a genome almost always exceeds that),
@althonos I see, for large sequence, I will try the above method. Thanks !
Release v0.7.0
now has an hmmscan
implementation, which is suited for doing what @seanrjohnson originally asked, so I'll be closing this issue.
I'm using pyhmmer to search a small number of profiles (for example custom profiles, not from Pfam) against large databases, such as UniProt. Here are a few things I have questions about:
I like using pyhmmer instead of the hmmsearch executable for this, because my input files are actually genbank genome files with CDS annotations, so with pyhmmer I can read the files, translate the CDSs and run hmmsearch all in memory without ever writing the protein sequences to disk, which is great.