althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
125 stars 12 forks source link

how to search HMM file with multiple profiles, ie Pfam? #23

Closed nextgenusfs closed 2 years ago

nextgenusfs commented 2 years ago

I'm probably missing something obvious, but I can't seem to figure out how to use pyhmmer to search against a database of HMM profiles, it Pfam. When I try to read the file, it does not seem to iterate over the individual profiles.

Loading sequences you can do this, which create a list of the digitized sequences in

 with pyhmmer.easel.SequenceFile('inputfile.fasta', digital=True, alphabet=alphabet) as seq_file:
    sequences = list(seq_file)

I naively assumed that the HMMFile would do the same?

with pyhmmer.plan7.HMMFile('/path/to/Pfam-A.hmm') as hmm_file:
    hmm = list(hmm_file.read())

This errors out with TypeError: 'pyhmmer.plan7.HMM' object is not an iterator.

If I just run this:

with pyhmmer.plan7.HMMFile('/path/to/Pfam-A.hmm') as hmm_file:
    hmm = hmm_file.read()

Then it is just the single HMM profile (the first one in the Pfam database 1-cysPrx_C). So what am I doing wrong?

nextgenusfs commented 2 years ago

Okay, well I guess this is the way to do it.

import pyhmmer
# load sequences
alphabet = pyhmmer.easel.Alphabet.amino()
with pyhmmer.easel.SequenceFile('queries.fasta', digital=True, alphabet=alphabet) as seq_file:
    sequences = list(seq_file)
# then load Pfam and search
with pyhmmer.plan7.HMMFile('/path/to/Pfam-A.hmm') as hmm_file:
    hits = list(pyhmmer.hmmsearch(hmm_file, sequences))
althonos commented 2 years ago

Hi @nextgenusfs ! Happy that you figured it out. The issue was in here:

I naively assumed that the HMMFile would do the same?

with pyhmmer.plan7.HMMFile('/path/to/Pfam-A.hmm') as hmm_file:
   hmm = list(hmm_file.read())

If you wanted to read all the HMMs from the input file, you'd have to do:

with pyhmmer.plan7.HMMFile('/path/to/Pfam-A.hmm') as hmm_file:
   hmms = list(hmm_file)

But what you end up doing in your second comment is even better, because it will stream the HMMs as they are read from the file, instead of pre-loading them all in a list (which, for a HMM database like Pfam, can load a lot of things in memory),