althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
129 stars 12 forks source link

`hmmscan` gets confused about short monotone sequences' alphabet. ( assumes `Amino`, actually `RNA`) #53

Closed rtviii closed 1 year ago

rtviii commented 1 year ago

I suppose this is an extremely unlikely usecase in the first place and i can circumvent this in my work, decided to write up anyway:

I have some 3-20 nucleotide long trna/mrna fragments sprinkled in with longer rna and protein chains i'm scanning against. Is there a way to enforce the pyhmmer.easel.Alphabet.rna() for those? For example, it infers amino for CCA, which i'm not sure where to override, though i see the FIXME in the wrapper.

Minimum reproducible example:

alphabet = pyhmmer.easel.Alphabet.rna()
sequence = "CCA"
query_seq  = pyhmmer.easel.TextSequence(name=bytes(seq_record.id,'utf-8'), sequence=sequence)
scans = list(pyhmmer.hmmscan([query_seq.digitize(alphabet)],[ <VALID 2000+ RNA HMMS> ] ))

The err i'm getting:


Traceback (most recent call last):
  File "/home/rtviii/dev/riboxyz/scripts/test.py", line 448, in <module>
    hmms.scan(alphabet, [ SeqRecord(cca['entity_poly_seq_one_letter_code_can']) ] )
  File "/home/rtviii/dev/riboxyz/ribctl/lib/classification.py", line 308, in scan
    scans = list(pyhmmer.hmmscan(query_seqs,[*self.class_hmms_registry.values()] ))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/ribxzvenv/lib/python3.11/site-packages/pyhmmer/hmmer.py", line 1481, in hmmscan
    profile.configure(item, _background)
  File "pyhmmer/plan7.pyx", line 7272, in pyhmmer.plan7.Profile.configure
  File "pyhmmer/plan7.pyx", line 7304, in pyhmmer.plan7.Profile.configure
pyhmmer.errors.AlphabetMismatch: Expected RNA alphabet, found amino alphabet
``
rtviii commented 1 year ago

in fact, i'm trying to scan all chains of a protein-rna complex in parallel (with 2 instances of hmmscan) (each of which is populated respectively with Amino- and RNA-based hmms).

I'm noticing that if i do either on its own it seems to work no problem, but together i always get either pyhmmer.errors.AlphabetMismatch: Expected RNA alphabet, found amino alphabet or pyhmmer.errors.AlphabetMismatch: Expected Amino alphabet, found rna alphabet. Is there a threading trick under the hood where they all go to the same pool with the same (first instantiated) alphabet?

# ex. snippet
# *HMMClassifier is a just a wrapper around hmmscan with state
pipeline_polynucleotides = HMMClassifier( _rna_polynucleotides, [p for p in list(RNAClass)])
pipeline_polynucleotides.scan_chains()
rna_report              = pipeline_polynucleotides.produce_classification()
print("rna report", rna_report)

pipeline_polypeptides    = HMMClassifier( _prot_polypeptides,  [p for p in [ *list(ProteinClass),*list(LifecycleFactorClass) ] ])
pipeline_polypeptides.scan_chains()
prots_report =            pipeline_polypeptides.produce_classification()
print("prots report", prots_report)

Thanks a lot!

althonos commented 1 year ago

Hi @rtviii,

HMM and sequences have an alphabet that actually matters to compute the scores in the HMMER pipeline. You cannot mix both RNA and protein HMMs inside a single call to hmmscan, because the Pipeline object will enforce all HMMs and all sequences to have the same alphabet. The approach of having two different calls is the correct one because your object types are different.

rtviii commented 1 year ago

Hi, thank Martin

Given that i took care to separate [rna sequences + rna-based hmms] and [protein sequences + aa-based hmms ] into two different hmmscan objects, you are saying there shouldn't be any problem there with having the two of them in memory?

althonos commented 1 year ago

Yes exactly, each call to hmmscan will initialize a new thread pool so there should not be a problem. Actually I guess the problem here is that hmmscan does not autodetect the alphabet like it should (assuming protein alphabet instead of looking at the query/target objects first), i'll make a patch.

rtviii commented 1 year ago

That's awesome to know. Thanks again for the quick reply.

Actually i started putting together an even more minimal example to convince me of the 2 hmmscan-objects solution above and am running into something weird now. Can you spot what's going on here maybe? Perhaps my problem is much more basic because it worked well enough for proteins and i only started having this issue when i added rna to the mix:

`RNA23S_ALIGNED` ```python3 RNA23S_ALIGNED = ['-----------AUACCACACACAAGGGAUCGUUUGGCUCCUGGGACGAGGAAGGGCGCAGCAGAGAGCGAUAUGU--GUGGGAGCAGCA-----GCAAUGA-CCACACGUCCCCGAAU---------------------------------ACAGCAUAGUGCUG-GAGUACUCCUUGGAAUU-AAGCAUAUGAGUAAAGGAAGGAAAAGAAACUAACGAG-GAUUCCCGUAGUAGCGGCGAGUGAACAGGGAACAGUCC-AACAGUGUAAUUCUCAAAUGAGAAGUUGUCAAAAGG--------AAGAAGAUGUGAAUCAGAAAGAA----------AAGCUGAACGAUACAGGGUGAUAGUCCCGUAGU------GUCUUCUUACAUACGGU-------GAGUA--GCUGUGCUCGGUAAUGCACAGUGAAGAG-GUGGCAGUGUGCAUCUAAGACUAAAUACAACAGGAGACCGAUAGCAAAGAAGUAGGUUGACCG--------AAAACAUCU---------AAGUGAAAUUGUGUGUGGACCCGAAUG--GAUCGGGCC---------------------------------------------------------------------------------CGUCUUGAAACACGGACCAAGGAGUGCACAUACACUGCAAGUUGAGCAGAGGAAGUCUGUGAGGCGUAGUGAAGACGGAA---------------------------------------------AGCAGUGUGAGUGCGACCCGAUAGACUUGUGAACUAUACCUUGUUGCUGUGAAGGCUGGCGAAAGCCAGCUGGAGGCGGCAAGCCGU-AUUGAUCUGCAAAUCAUUGGCGUAAGCAGGGUAUAGGGGCGAAAGACCAAUCGAACAGUCUAGUUGCUGGUUCCCUCCGAAAUGUCUAUCAGGACAG-CGCGCACGCAGAGAGAAGCAAGGUAGAGC-AUGGGCCGGUAU--------CCUUAAGGAGACGGACGAACUGCGAAUGUGCUUUUCUUCUAUGCAAGCGUAGUGGGCGAGUUCUGGUAAGCAGGACUG-----------GCGAAGAGGAAUGAACCUGGCACUGUGCUAAGGAACCCAAUGUCUGGACAGAG--------ACCGAAAGAGGUAGGUGAAUAAGGACGGUAGGGCGGUAGCCAUGGAAGUCGGCAUCCGGUAAGAAACGUGUUACAACGUACCUACGGAAUUCAUCUGCUCUGAAAAUUGA-UGGCGCUUA----------------CCAGACAUCCGAUGCACA---GUUAGCAGGUAGGAGGGCG-UGUGUCUACUGGCGAAGGUA-CCGAGUGAUCGGUGGUGGAGGUAGAUGCAGAGCAGAUCUUGGUGGUAGUAGCGAUAAUUUGUCUUCAGCGGCAAAGACUG--AGGAGGAGAAGGGUUUCUUC-------UCAUAAG--AAGAGUGAGCCGGG-UCUAAGCGAGUGUGUAG---AAGCAUGAGCGAAGGAGAAACAGGUUGAUAUUCCUGUGCCUGUGCUGUCGUACGGCAACGUGUAUUAGCUUGUCGACACGAACAUCUAGACAAAGCAGAAACACCUUUUCUGUUCAAGGGGAAAAAGACCACUGAGACGGAUCAUCUGGCGACGUGGUUUUAUAUCUGUGACCAGGCCUGAGCCUGGUUUGUUGAGGAUGGUCGUCCAUGAAAAGAC---AACUAUAUGGCAGCACAGUCCGUACC---AACCGUAUCAGGACUCCAAGGUGAAGAGCCUCUAG---UG-GAUGUUUUACACAUGUAAGGGAAUUCGGCAAAAUGGAUGCGUAACUUCGGGAGAAGUAUUGGUU----CAAUUGUGAAC---------UCAUUGUGACAAGGGGAAUCUGACUGUUUAGUAAAAACAUAGCU----------UCAUGAAGAGAGAU----GAAGUGAGUUCUGCCCGGUGCUCG---GUCGUGACACGGAGGAAAUGCCAGA-----AAGCACGGGUUAACGGCGGGAGUAACUAUGACUCUCUUAAGGUAGCCAAACGCCUCGUCAUUUAAAUGGUGACGCGCAUGAAUGGAAUGACGAGAUUCCCACUGUCCCUACAUGUAGACUAGCGAACCC--AUUUCCAGGGGAACGGGCCUGGACGGCCAGCGGGGAAAGAAGACCCUGUUGAGCUUGACU-----CUAGUGUGGCCAAGUUGUGACGAUUG---AGGCGAUGUAGGAA----------------------GGUGGAGAGCGCAAGCAGAAGUGAAAGACCACUGCGCCAUGAUCAUCA-CGACACACUUGUUUAAGGACAAAG-GCCAGAUGGGGAGUUUGGCUGGGGCGGUACGACCACAGAAUCAGAACGUGGUCGACCGAAGGUACUCACAGCGAGACGAGAACACUCGUGUAGAGCAUAAGGACUAAAGAGUGCCUGAGUGUGAGCACUACUGCUUGCA----UGAGGGGAAACCCGGGCCUAGCGAUCCUACGCAUACGAGACACUUGUGGCGUGGGUGUCAG--AAAAGUUACCACAGGGAUAACUGGCUUGUGGCCGCCGAGCGUAGAAAGCGACGCGGCUUUUUGAUUCUUCGAUGUCGGCUCUUCCUAGCAUGGCCAUGCAGCCGUGGCGAAGUGUUGGAUUGUUCACCCACUAACAGGGAACGUGAGCUGGGUUUAGACCGUCGUGAGACAGGUUAGUUUUAUCCUACUGUCCAAACAG------UUGAAGGGAGUG-UGGAUUAGUACGAAAGGAAACCCACACG-UGACCUCUGGUGCAGCGGGUGUGCGG--AAGUGCAACUGCCAUGCUACGUCUC-CUCAAG----CAUAGCUGGAAGCCUCUAAGCUAGAAAUGAGUCCUGACUCUGGAAGACGGCCGGGAAGACGACCCGUAGUUAACAGCGGUGUUGUACGAUAUGAGCUUUCUAUUUGU-GUAGUAGCGAGUAUUUGAGAC-ACUGGUUGUUACC-------ACCGUCGUUUAUU', 'GGUCAAGAUGGUAAGGGCCCACGGUGGAUGCCUCGGCACCCGAGCCGAUGAAGGACGUGGCUACCUGCGAUAAGCCAGGGGGAGCCGGUAGCGGGCGUGGAUCCCUGGAUGUCCGAAUGGGGGAACCCGGCCGGCGGGAACGCCGGUCACCGCGCUUUUGCGCGG-GGGGAACCUGGGGAACUGAAACAUCUCAGUACCCAGAGGAGAGGAAAGAGAAAUC-GACUCCCUGAGUAGCGGCGAGCGAAAGGGGACCAGCCUAAACCGUCCGGCUUGUCCGGGCGGGGUCGUG---GGGCCCUCGGACACCGAAUCCCCAGCCUAGCCGAAGCUGUUGGGAAGCAGCGCCAGAGAGGGUGAAAGCCCCGUAGGCGAAAGGUGGGGGGAUAGGUGAGGGUACCCGAGUACCCCGUGGUUCGUGGAGCCAUGGGGGAAUCUGGGCGGACCACCGCCUAAGGCUAAGUACUCCGGGUGACCGAUAGCGCACCAGUACCGUGAGGGAAAGGUGAAAAGAACCC-CGGGAGGGGAGUGAAAUAGAGCCUGAAACCG---------UGGGCUUACAAGCAGUCACGGCCCCGCAA--------------------------------------------GGGGUUGUGGCGUGCCUAUUGAAGCAUGAGCCGGCGACUCACGGUCGUGGGCGAGCUUAAGCCGUUGAGGC--GGAGGCGUAGGGAAACCGAGU------CCGAACAGGGCGCAAAGUCCGC-----------------GGCCGUG-------GACCCGA-AACCGGGCGAGCUAGCCCUGGCCAGGGUGAAGCUGGGGUGAGACCCAGUGGAGGCCCGAACCGGUGGGGGA--UGCAAACCCCUCGGAUGAGCUGGGGCUAGGAGUGAAAAGCUAACCGAGCCCGGAGAUAGCUGGUUCUCCCCGAAAUGACUUUAGGGUCAGCCUCAGGCGCUGACUGGGGC-CUGUAGAGC-ACUGAUAGGGCUAGGGGGC-CCACCAGCCUACCAAAC-CCUGUCAAACUCCGAAGGGUC---CCAGGUGGAGCCUGGGAGUGAGGGCGCGAGCGAUAACGUCCGCGUCCGAGCGCGGGAACAACCGAGACCGCCAGCUAAGGCCCCCAA-GUCUGGGCUAAGUGG-------UAAAGGAUGUGGCGCCGCGAAGACAGCCAGGAGGUUGGCUUAGAAGCAGCCAUCCUUUAAAGAGUGCGUAAUAGCUCACUGGUCGAGUGGCGCCGCGCCGAAAAUGAUGCGGGGCUUAAGCCCAGCGCCGAAG-CUGCGGGUCUGGGGGAUGACCCCAGGC-GGUAGGGGAGCGUUCCCGAUGCCGAUGAAGGCCGACCCGCGAG-GCGGCUGGAGGUAAGGGAAGUGCGAAUGCCGGCAUGAGUAACGAUAA--AGAGGGUGAGAAUCCCUCUCGCCGUAAGCCCAAGGGUUCCUACGCAAUGGUCGUCAGCGUAGGGUUAGGCGGGACCUAAGGUGAAGCCGAA---AGGCGUAGCCGAA-GGGCAGCCGGUUAAUAUUCCGGCCCUUCCCGCA----GGUGCGAUGGGGGGACGCUCUAGGCUAGGGGGACCGGAGCCAUGGACGAGC-CCGGCCAGAAGCGCAGGGUGGGAGGUAGGCAAAUCCGCCUCCCAACAAGCUCUGCGUGGUGGGGAAGCCCGUACGGGUG-ACAACCCCCCGAAGCCAGGGAGCCAAGAAAAGCCUCUAAGCACAACCUGCGGGAACCCGUACCGCAAACCGACACAGGUGGGCGGGUGCAAGAGCACUCAGGCGCG-CGGGAG-AACCCUCGCCAAGGAACUCUGCAAGUUGGCCCCGUAACUUCGGGAGAAGGGGUGCUCCCUGGGGUGAUGAGCCCCGGGGAGCCGCAGUGAACAGGCUCUGGCGACUGUUUACCAAAAACACAGCUCUCUGCGAACUCGUAAGAGGAGGUAUAGGGAGCGACGCUUGCCCGGUGCCGGAAGGUCAAGGGGAGGGGUGCAAGCCCCGAACCGAAGCCCCGGUGAACGGCGGCCGUAACUAUAACGGUCCUAAGGUAGCGAAAUUCCUUGUCGGGUAAGUUCCGACCUGCACGAAAAGCGUAACGACCGGAGCGCUGUCUCGGCGAGGGACCCGGUGAAAUUGAACUGGCCGUGAAGAUGCGGCCUACCCGUGGCAGGACGAAAAGACCCCGUGGAGCUUUACUGCAGCCUGGUGUUGGCUCUUGGUCGCGCCUGCGUAGGAUAGGUGGGAGCCUGUGAACCCCCGCCUCCG--GGUGGGGGG--GAGGCGCCGGUGAAAUACCACCCUGGCGCGGCUGGGGGCCUAACCCUCGGAUGGGGGGACAGCGCUUGGCGGGCAGUUUGACUGGGGCGGU-CGCCUCCUAAAAGGUAACGGAGGCGCCCAAAGGUCCCCUCAGGCGGGACGGAAAUCCGCCGGAGAGCGCAAGGGUAGAAGGGGGCCUGACUGCGAG---GCCUGCAAGCCGAGCAGGGGCGAAAGCCGGGCCUAGUGAACCGGUGGUCCCGUGUGGAAGGGCCAUCGAUCAACGGAUAAAAGUUACCCCGGGGAUAACAGGCUGAUCUCCCCCGAGCGUCCACAGCGGCGGGGAGGUUUGGCACCUCGAUGUCGGCUCGUCGCAUCCUGGGGCUGAAGAAGGUCCCAAGGGUUGGGCUGUUCGCCCAUUAAAGCGGCACGCGAGCUGGGUUCAGAACGUCGUGAGACAGUUCGGUCUCUAUCCGCCACGGGCGCAG-GAGGCUUGAGGGGGGCU-CUUCCUAGUACGAGAGGACCGGAAGGGACGCACCUCUGGUUUCCCAGCUGUCCCUCCAGGGGCAUAAGCUGGGUAGCCAUGUGCGGAAGGGAUAACCGCUGAAAGCAUCUAAGCGGGAAGCCCGCCCCAAGAUGAGGCCUCCCACGGCGUCAAGCCGGUAAGGACCCGGGAAGACCACCCGGUGGAUGGGCCGGGGGUGUAAGCGCCGCGAGGCGUUGAGCCGACCGGUCCCAAUCGUCCGAGGUCUUGACCCCUC', '----UUGGCUACUAUGCCAGCUGGUGGAUUGCUCGGCUCAGGCGCUGAUGAAGGACGUGCCAAGCUGCGAUAAGCCAUGGGGAGCCGCACGGAGGCGAAGAACCAUGGAUUUCCGAAU-----------------GAGAAUCUCUCUAACAAUUGCUUCGCGCAAUGAGGAACCCCGAGAACUGAAACAUCUCAGUAUCGGGAGGAACAGAAAACGCAAUGUGAUGUCGUUAGUAACCGCGAGUGAACGCGAUACAGCCCAAACCG--AAGCCCUCACGGGCAAUGUGGUGUCAGGGCUACCUCUCAUCAGCCGACCGUCUCGACGAAGUCUCUUGGAACAGAGCGUGAUACAGGGUGACAACCCCGUACUCGAGA-CCAGUACGACGUGCGGUAGUGCCAGAGUA--GCGGGGGUUGGAUAUCCCUCGCGAAUAACGCAGGCAUCGACUGCGAAGGCUAAACACAACCUGAGACCGAUAGUGAACAAGUAGUGUGAACGAACGCUGCAAAGUACCCUCAGAAGGGAGGCGAAAUAGAGCAUGAAAUCAGUUGGCGAUCGAGCGACAGGGCAUACAAGGUCCCUCGACGAAUGACCGACGCGCGAGCGUCCAGUAAGACUCACGGGAAGCCGAUGUUCUGUCGUACGUUUUGAAAAACGAGCCAGGGAGUGUGUCUGCAUGGCAAGUCUAACCGGAGUAUCCGGGGAGGCACAGGGAAACCGACAUGGCCGCAGGGCUUUGCCCGAGGGCCGCCGUCUUCAAGGGCGGGGAGCCAUGUGGACACGACCCGA-AUCCGGACGAUCUACGCAUGGACAAGAUGAAGCGUGCCGAAAGGCACGUGGAAGUCUGUUAGAGUUGGUGUCCUACAAUACCCUCUCGUGAUCUAUGUGUAGGGGUGAAAGGCCCAUCGAGUCCGGCAACAGCUGGUUCCAAUCGAAACAUGUCGAAGCAUGACCUCCGCCGAGGUAGUCUGUGAGGUAGAGCGACCGAUUGGUGUGUCCGCCUCCGAGAGGAGUCGGCACACCUGUCAAACUCCAAACUUACAGACGCCGUUUGACGCGGGGAUUCCGGUGCGCGGGGUAA-GCCUGUGUACCAGGAGGGGAACAACCCAGAGA-UAGGUUAAGGUCCCCAA-GUGUGGAUUAAGUGUAAUCCUCUGAAGGUGGUCUCGAGCCCUAGACAGCCGGGAGGUGAGCUUAGAAGCAGCUACCCUCUAAGAAAAGCGUAACAGCUUACCGGCCGAGGUUUGAGGCGCCCAAAAUGAU-CGGGACUCAAAUCCACCACCGAGACCUGUCCGUACCACUCAUACUGGUAAUCGAGUAGAUUGGCGCUCUAAUUG--GAUGGAAGUAGGGGUGAAAACUCCUAUGGACCGAUUAGUGACGAAAAUCCUGGCCAUAGUAGCAGCGAUAGUCGGGUGAGAACCCCGACGGCCUAAUGGAUAAGGGUUCCUCAGCACUGCUGAUCAGCUGAGGGUUAGCCGGU-CCUAAGUCAUACCGCAACUCGACUAUGACGAAAUGGGAAACGGGUUAAUAUUCCCGUGCCACU--------------AUGCAGUGAAAGUUGACGCCCUGGGGUCGAUCACGCUGGGCAUUCGCCCAGUCGAACC----GUCCAACUCCGUGGAAGCCGUAAUGGCAGGAAGCG------GACGAACGGCGGCAUAGGGAAACGUGAUUC-----AACCUGGGGCCCAUGAAAAGAC---GAGCAUA--------GUGUCCGUACCGAGAACCGACACAGGUGUCCAUGGCGGCGAAAGCCAAGGCCUGUCGGGAGCAACCAACGUUAGGGAAUUCGGCAAGUUAGUCCCGUACCUUCGGAAGAAGGGAUGCCUGCUCCGGAACGGAGC------AGGUCGCAGUGACUCGGAAGCUCGGACUGUCUAGUAACAACAUAGGUGACCGCAAAUCCGCAAGGACUCGUACGGUCACUGAAUCCUGCCCAGUGCAGG--UAUCUGAACACCUCGUACAAG----AGGACGAAGGACCUGUCAACGGCGGGGGUAACUAUGACCCUCUUAAGGUAGCGUAGUACCUUGCCGCAUCAGUAGCGGCUUGCAUGAAUGGAUUAACCAGAGCUUCACUGUCCCAACGUUGGGCCCGGUGAACUGUACAUUCCAGUGCGGAGUCUGGAGACACCCAGGGGGAAGCGAAGACCCUAUGGAGCUUUACUGCAGGCUGUCGCUGAGACGUGGUCGCCGAUGUGCAGCAUAGGUAGGAGACACUACACAGGUACCCGCGCUAGCGGGCCACCGAGUCAACAGUGAAAUACUACCCGUCGGUGACUGCGACUCUCACUCCGGGAGGAGGACAC--CGAUAGCCGGGCAGUUUGACUGGGGCGGUACGCGCUCGAAAAGAUAUCGAGCGCGCCCUAUGGCUAUCUCAGCCGGGACAGAGACCCGGCGAAGAGUGCAAGAGCAAAAGAUAGCUUGACAGUGUUCUUCCCAACGAGGAACGCUGACGCGAAAGCGUGGUCUAGCGAACCAAUUAGCCUGCUUGAUGCGGGCAAUUGAUGACAG--AAAAGCUACCCUAGGGAUAACAGAGUCGUCACUCGCAAGAGCACAUAUCGACCGAGUGGCUUGCUACCUCGAUGUCGGUUCCCUCCAUCCUGCCCGUGCAGAAGCGGGCAAGGGUGAGGUUGUUCGCCUAUUAAAGGAGGUCGUGAGCUGGGUUUAGACCGUCGUGAGACAGGUCGGCUGCUAUCUACUGGGUGUGUAAUGGUGUCUGACAAGAACGACCGUAUAGUACGAGAGGAACUACGGUUGGUGGCCACUGGUGUACCGGUUGUUCGA--GAGAGCACGUGCCGGGUAGCCACGC-CACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGAC-ACCGCCGAGGUCCCG-CGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGC-------GUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU-----'] ```

target_rna_seq     = "GGUUAAGCGACUAAGCGUACACGGUGGAUGCCCUGGCAGUCAGAGGCGAUGAAGGACGUGCUAAUCUGCGAUAAGCGUCGGUAAGGUGAUAUGAACCGUUAUAACCGGCGAUUUCCGAAUGGGGAAACCCAGUGUGUUUCGACACACUAUCAUUAACUGAAUCCAUAGGUUAAUGAGGCGAACCGGGGGAACUGAAACAUCUAAGUACCCCGAGGAAAAGAAAUCAACCGAGAUUCCCCCAGUAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCUGAAUCAGUGUGUGUGUUAGUGGAAGCGUCUGGAAAGGCGCGCGAUACAGGGUGACAGCCCCGUACACAAAAAUGCACAUGCUGUGAGCUCGAUGAGUAGGGCGGGACACGUGGUAUCCUGUCUGAAUAUGGGGGGACCAUCCUCCAAGGCUAAAUACUCCUGACUGACCGAUAGUGAACCAGUACCGUGAGGGAAAGGCGAAAAGAACCCCGGCGAGGGGAGUGAAAAAGAACCUGAAACCGUGUACGUACAAGCAGUGGGAGCACGCUUAGGCGUGUGACUGCGUACCUUUUGUAUAAUGGGUCAGCGACUUAUAUUCUGUAGCAAGGUUAACCGAAUAGGGGAGCCGAAGGGAAACCGAGUCUUAACUGGGCGUUAAGUUGCAGGGUAUAGACCCGAAACCCGGUGAUCUAGCCAUGGGCAGGUUGAAGGUUGGGUAACACUAACUGGAGGACCGAACCGACUAAUGUUGAAAAAUUAGCGGAUGACUUGUGGCUGGGGGUGAAAGGCCAAUCAAACCGGGAGAUAGCUGGUUCUCCCCGAAAGCUAUUUAGGUAGCGCCUCGUGAAUUCAUCUCCGGGGGUAGAGCACUGUUUCGGCAAGGGGGUCAUCCCGACUUACCAACCCGAUGCAAACUGCGAAUACCGGAGAAUGUUAUCACGGGAGACACACGGCGGGUGCUAACGUCCGUCGUGAAGAGGGAAACAACCCAGACCGCCAGCUAAGGUCCCAAAGUCAUGGUUAAGUGGGAAACGAUGUGGGAAGGCCCAGACAGCCAGGAUGUUGGCUUAGAAGCAGCCAUCAUUUAAAGAAAGCGUAAUAGCUCACUGGUCGAGUCGGCCUGCGCGGAAGAUGUAACGGGGCUAAACCAUGCACCGAAGCUGCGGCAGCGACGCUUAUGCGUUGUUGGGUAGGGGAGCGUUCUGUAAGCCUGCGAAGGUGUGCUGUGAGGCAUGCUGGAGGUAUCAGAAGUGCGAAUGCUGACAUAAGUAACGAUAAAGCGGGUGAAAAGCCCGCUCGCCGGAAGACCAAGGGUUCCUGUCCAACGUUAAUCGGGGCAGGGUGAGUCGACCCCUAAGGCGAGGCCGAAAGGCGUAGUCGAUGGGAAACAGGUUAAUAUUCCUGUACUUGGUGUUACUGCGAAGGGGGGACGGAGAAGGCUAUGUUGGCCGGGCGACGGUUGUCCCGGUUUAAGCGUGUAGGCUGGUUUUCCAGGCAAAUCCGGAAAAUCAAGGCUGAGGCGUGAUGACGAGGCACUACGGUGCUGAAGCAACAAAUGCCCUGCUUCCAGGAAAAGCCUCUAAGCAUCAGGUAACAUCAAAUCGUACCCCAAACCGACACAGGUGGUCAGGUAGAGAAUACCAAGGCGCUUGAGAGAACUCGGGUGAAGGAACUAGGCAAAAUGGUGCCGUAACUUCGGGAGAAGGCACGCUGAUAUGUAGGUGAGGUCCCUCGCGGAUGGAGCUGAAAUCAGUCGAAGAUACCAGCUGGCUGCAACUGUUUAUUAAAAACACAGCACUGUGCAAACACGAAAGUGGACGUAUACGGUGUGACGCCUGCCCGGUGCCGGAAGGUUAAUUGAUGGGGUUAGCGCAAGCGAAGCUCUUGAUCGAAGCCCCGGUAAACGGCGGCCGUAACUAUAACGGUCCUAAGGUAGCGAAAUUCCUUGUCGGGUAAGUUCCGACCUGCACGAAUGGCGUAAUGAUGGCCAGGCUGUCUCCACCCGAGACUCAGUGAAAUUGAACUCGCUGUGAAGAUGCAGUGUACCCGCGGCAAGACGGAAAGACCCCGUGAACCUUUACUAUAGCUUGACACUGAACAUUGAGCCUUGAUGUGUAGGAUAGGUGGGAGGCUUUGAAGUGUGGACGCCAGUCUGCAUGGAGCCGACCUUGAAAUACCACCCUUUAAUGUUUGAUGUUCUAACGUUGACCCGUAAUCCGGGUUGCGGACAGUGUCUGGUGGGUAGUUUGACUGGGGCGGUCUCCUCCUAAAGAGUAACGGAGGAGCACGAAGGUUGGCUAAUCCUGGUCGGACAUCAGGAGGUUAGUGCAAUGGCAUAAGCCAGCUUGACUGCGAGCGUGACGGCGCGAGCAGGUGCGAAAGCAGGUCAUAGUGAUCCGGUGGUUCUGAAUGGAAGGGCCAUCGCUCAACGGAUAAAAGGUACUCCGGGGAUAACAGGCUGAUACCGCCCAAGAGUUCAUAUCGACGGCGGUGUUUGGCACCUCGAUGUCGGCUCAUCACAUCCUGGGGCUGAAGUAGGUCCCAAGGGUAUGGCUGUUCGCCAUUUAAAGUGGUACGCGAGCUGGGUUUAGAACGUCGUGAGACAGUUCGGUCCCUAUCUGCCGUGGGCGCUGGAGAACUGAGGGGGGCUGCUCCUAGUACGAGAGGACCGGAGUGGACGCAUCACUGGUGUUCGGGUUGUCAUGCCAAUGGCACUGCCCGGUAGCUAAAUGCGGAAGAGAUAAGUGCUGAAAGCAUCUAAGCACGAAACUUGCCCCGAGAUGAGUUCUCCCUGACCCUUUAAGGGUCCUGAAGGAACGUUGAAGACGACGACGUUGAUAGGCCGGGUGUGUAAGCGCAGCGAUGCGUUGAGCUAACCGGUACUAAUGAACCGUGAGGCUUAACCU"

import os
import pyhmmer
from pyhmmer.easel import (
    TextSequence,
)
rna_alphabet   = pyhmmer.easel.Alphabet.rna()

def _get_rna_hmm():
    seeds         = [TextSequence(name=os.urandom(4), sequence=s) for s in RNA23S_ALIGNED]
    builder       = pyhmmer.plan7.Builder(alphabet=rna_alphabet)
    background    = pyhmmer.plan7.Background(rna_alphabet)
    anonymous_msa = pyhmmer.easel.TextMSA(os.urandom(8),sequences=seeds)
    rna23s_HMM, _profile, _optmized_profile = builder.build_msa(anonymous_msa.digitize(rna_alphabet), background)
    return rna23s_HMM

digital_rnaseq = TextSequence(name=b"target_rnaseq", sequence=target_rna_seq).digitize(rna_alphabet)
rna_symbols = set("".join([* RNA23S_ALIGNED, target_rna_seq])) # collapse everything to 5 letters ACUG-

for S in rna_symbols:
    assert(S in rna_alphabet.symbols) # make sure there is no stray AAs in the seq

list(pyhmmer.hmmscan(digital_rnaseq, [ _get_rna_hmm() ]))

I can't for the life of me figure out what i'm missing here.Error:

<HMM alphabet=Alphabet.rna() M=2892 name=b'\xe1\xac\x89h!h:\x07'>
Traceback (most recent call last):
  File "/home/rtviii/dev/phmmer_test/test.py", line 98, in <module>
    list(pyhmmer.hmmscan(digital_rnaseq, [ _get_rna_hmm()]))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtviii/dev/phmmer_test/venv/lib/python3.11/site-packages/pyhmmer/hmmer.py", line 1481, in hmmscan
    profile.configure(item, _background)
  File "pyhmmer/plan7.pyx", line 7272, in pyhmmer.plan7.Profile.configure
  File "pyhmmer/plan7.pyx", line 7304, in pyhmmer.plan7.Profile.configure
pyhmmer.errors.AlphabetMismatch: Expected RNA alphabet, found amino alphabet
rtviii commented 1 year ago

This gave more pause also because when I just search all models one by one via plan7.pipeline hmm_search -- it works, all other things being equal:

# ? Same module, process, variables as above..
...
def seq_evaluate_v_hmm(seq:str,alphabet:pyhmmer.easel.Alphabet, hmm:pyhmmer.plan7.HMM):
    seq_  = pyhmmer.easel.TextSequence(name=b"NA", sequence=seq)
    dsb   = DigitalSequenceBlock(alphabet, [seq_.digitize(alphabet)])
    return pyhmmer.plan7.Pipeline(alphabet=alphabet, T=100).search_hmm(hmm,dsb)

rna_hmm = _get_rna_hmm()
print(list(seq_evaluate_v_hmm(target_rna_seq, rna_alphabet, rna_hmm))[0])
><pyhmmer.plan7.Hit object at 0x7f32edf10160>
rtviii commented 1 year ago

Figured it out, was a silly class/instance attribute mistake so HMMs persisted between objects. Thanks nonetheless!