althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
125 stars 12 forks source link

nhmmer reports Alphabet mismatch #12

Closed fbosnic closed 3 years ago

fbosnic commented 3 years ago

I'm trying to run nhmmer function but can't seem to get the alphabet right. Running

    import pyhmmer    
    seq1 = pyhmmer.easel.TextSequence(name=b"seq1", sequence="ACCGACA")
    seq2 = pyhmmer.easel.TextSequence(name=b"seq2", sequence="GGGCCAACA")
    rna = pyhmmer.easel.Alphabet.rna()
    dig1, dig2 = [s.digitize(rna) for s in [seq1, seq2]]
    builder = pyhmmer.plan7.Builder(rna, prior_scheme="alphabet")

    gen = pyhmmer.hmmer.nhmmer([dig1], [dig2], builder=builder)
    next(gen)

results in

    Traceback (most recent call last):
      File "pyhmmer_test.py", line 10, in <module>
        next(gen)
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 310, in _multi_threaded
        raise thread.error from None
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 112, in run
        self.process(index, query)
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 125, in process
        hits = self.search(query)
      File "/apps/conda/fbosnic/envs/test/lib/python3.8/site-packages/pyhmmer/hmmer.py", line 166, in search
        return self.pipeline.search_seq(query, self.sequences, self.builder)
      File "pyhmmer/plan7.pyx", line 3777, in pyhmmer.plan7.Pipeline.search_seq
      File "pyhmmer/plan7.pyx", line 3819, in pyhmmer.plan7.Pipeline.search_seq
    pyhmmer.errors.AlphabetMismatch: Expected Alphabet.amino(), found Alphabet.rna()

Am I using it correctly, does the alphabet need to be set somewhere else as well?

As far as I could trace it, the following line might be the cause https://github.com/althonos/pyhmmer/blob/eb6fe7c0e74557e0ae9d647693711583d2d86b68/pyhmmer/hmmer.py#L82

althonos commented 3 years ago

Hi @fbosnic ,

The issue you're having is indeed coming from that particular line, and since I was not testing nhmmer until now I didn't realize there was an bug there. However, after writing more extensive test cases, I also noticed that I couldn't get pyhmmer.hmmer.nhmmer to get the same hits as the nhmmer binary. I implemented the pyhmmer.hmmer.nhmmer function thinking nhmmer was just doing the same thing as phmmer with the DNA or RNA alphabet instead of the Amino alphabet, but since it may not be the case, I'll try to look into this issue a bit further and look if I need to rewrite the current implementation.

fbosnic commented 3 years ago

Hi @althonos,

I see, thanks for the answer. I would be very interested in using the function once this is sorted out.

althonos commented 3 years ago

Hi @fbosnic ,

The difference between nhmmer and phmmer is that nhmmer will use the "long target pipeline" instead of the default pipeline used by everything else. This allows pipelined searches to run on target sequences longer than 100,000 residues using a sliding window to process each target.

I've released v0.4.9, which has a fixed implementation for nhmmer. The function now accepts HMM arguments in addition to DigitalSequence and DigitalMSA arguments, and will run a long target search pipeline. I have added tests to make sure the results are consistent between HMMER and PyHMMER on the same query HMM and target genome.