Closed rtviii closed 1 year ago
in fact, i'm trying to scan all chains of a protein-rna complex in parallel (with 2 instances of hmmscan
) (each of which is populated respectively with Amino
- and RNA
-based hmms).
I'm noticing that if i do either on its own it seems to work no problem, but together i always get either pyhmmer.errors.AlphabetMismatch: Expected RNA alphabet, found amino alphabet
or pyhmmer.errors.AlphabetMismatch: Expected Amino alphabet, found rna alphabet
. Is there a threading trick under the hood where they all go to the same pool with the same (first instantiated) alphabet?
# ex. snippet
# *HMMClassifier is a just a wrapper around hmmscan with state
pipeline_polynucleotides = HMMClassifier( _rna_polynucleotides, [p for p in list(RNAClass)])
pipeline_polynucleotides.scan_chains()
rna_report = pipeline_polynucleotides.produce_classification()
print("rna report", rna_report)
pipeline_polypeptides = HMMClassifier( _prot_polypeptides, [p for p in [ *list(ProteinClass),*list(LifecycleFactorClass) ] ])
pipeline_polypeptides.scan_chains()
prots_report = pipeline_polypeptides.produce_classification()
print("prots report", prots_report)
Thanks a lot!
Hi @rtviii,
HMM and sequences have an alphabet that actually matters to compute the scores in the HMMER pipeline. You cannot mix both RNA and protein HMMs inside a single call to hmmscan
, because the Pipeline
object will enforce all HMMs and all sequences to have the same alphabet. The approach of having two different calls is the correct one because your object types are different.
Hi, thank Martin
Given that i took care to separate [rna sequences + rna-based hmms] and [protein sequences + aa-based hmms ] into two different hmmscan
objects, you are saying there shouldn't be any problem there with having the two of them in memory?
Yes exactly, each call to hmmscan
will initialize a new thread pool so there should not be a problem. Actually I guess the problem here is that hmmscan
does not autodetect the alphabet like it should (assuming protein alphabet instead of looking at the query/target objects first), i'll make a patch.
That's awesome to know. Thanks again for the quick reply.
Actually i started putting together an even more minimal example to convince me of the 2 hmmscan
-objects solution above and am running into something weird now. Can you spot what's going on here maybe? Perhaps my problem is much more basic because it worked well enough for proteins and i only started having this issue when i added rna to the mix:
target_rna_seq = "GGUUAAGCGACUAAGCGUACACGGUGGAUGCCCUGGCAGUCAGAGGCGAUGAAGGACGUGCUAAUCUGCGAUAAGCGUCGGUAAGGUGAUAUGAACCGUUAUAACCGGCGAUUUCCGAAUGGGGAAACCCAGUGUGUUUCGACACACUAUCAUUAACUGAAUCCAUAGGUUAAUGAGGCGAACCGGGGGAACUGAAACAUCUAAGUACCCCGAGGAAAAGAAAUCAACCGAGAUUCCCCCAGUAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCUGAAUCAGUGUGUGUGUUAGUGGAAGCGUCUGGAAAGGCGCGCGAUACAGGGUGACAGCCCCGUACACAAAAAUGCACAUGCUGUGAGCUCGAUGAGUAGGGCGGGACACGUGGUAUCCUGUCUGAAUAUGGGGGGACCAUCCUCCAAGGCUAAAUACUCCUGACUGACCGAUAGUGAACCAGUACCGUGAGGGAAAGGCGAAAAGAACCCCGGCGAGGGGAGUGAAAAAGAACCUGAAACCGUGUACGUACAAGCAGUGGGAGCACGCUUAGGCGUGUGACUGCGUACCUUUUGUAUAAUGGGUCAGCGACUUAUAUUCUGUAGCAAGGUUAACCGAAUAGGGGAGCCGAAGGGAAACCGAGUCUUAACUGGGCGUUAAGUUGCAGGGUAUAGACCCGAAACCCGGUGAUCUAGCCAUGGGCAGGUUGAAGGUUGGGUAACACUAACUGGAGGACCGAACCGACUAAUGUUGAAAAAUUAGCGGAUGACUUGUGGCUGGGGGUGAAAGGCCAAUCAAACCGGGAGAUAGCUGGUUCUCCCCGAAAGCUAUUUAGGUAGCGCCUCGUGAAUUCAUCUCCGGGGGUAGAGCACUGUUUCGGCAAGGGGGUCAUCCCGACUUACCAACCCGAUGCAAACUGCGAAUACCGGAGAAUGUUAUCACGGGAGACACACGGCGGGUGCUAACGUCCGUCGUGAAGAGGGAAACAACCCAGACCGCCAGCUAAGGUCCCAAAGUCAUGGUUAAGUGGGAAACGAUGUGGGAAGGCCCAGACAGCCAGGAUGUUGGCUUAGAAGCAGCCAUCAUUUAAAGAAAGCGUAAUAGCUCACUGGUCGAGUCGGCCUGCGCGGAAGAUGUAACGGGGCUAAACCAUGCACCGAAGCUGCGGCAGCGACGCUUAUGCGUUGUUGGGUAGGGGAGCGUUCUGUAAGCCUGCGAAGGUGUGCUGUGAGGCAUGCUGGAGGUAUCAGAAGUGCGAAUGCUGACAUAAGUAACGAUAAAGCGGGUGAAAAGCCCGCUCGCCGGAAGACCAAGGGUUCCUGUCCAACGUUAAUCGGGGCAGGGUGAGUCGACCCCUAAGGCGAGGCCGAAAGGCGUAGUCGAUGGGAAACAGGUUAAUAUUCCUGUACUUGGUGUUACUGCGAAGGGGGGACGGAGAAGGCUAUGUUGGCCGGGCGACGGUUGUCCCGGUUUAAGCGUGUAGGCUGGUUUUCCAGGCAAAUCCGGAAAAUCAAGGCUGAGGCGUGAUGACGAGGCACUACGGUGCUGAAGCAACAAAUGCCCUGCUUCCAGGAAAAGCCUCUAAGCAUCAGGUAACAUCAAAUCGUACCCCAAACCGACACAGGUGGUCAGGUAGAGAAUACCAAGGCGCUUGAGAGAACUCGGGUGAAGGAACUAGGCAAAAUGGUGCCGUAACUUCGGGAGAAGGCACGCUGAUAUGUAGGUGAGGUCCCUCGCGGAUGGAGCUGAAAUCAGUCGAAGAUACCAGCUGGCUGCAACUGUUUAUUAAAAACACAGCACUGUGCAAACACGAAAGUGGACGUAUACGGUGUGACGCCUGCCCGGUGCCGGAAGGUUAAUUGAUGGGGUUAGCGCAAGCGAAGCUCUUGAUCGAAGCCCCGGUAAACGGCGGCCGUAACUAUAACGGUCCUAAGGUAGCGAAAUUCCUUGUCGGGUAAGUUCCGACCUGCACGAAUGGCGUAAUGAUGGCCAGGCUGUCUCCACCCGAGACUCAGUGAAAUUGAACUCGCUGUGAAGAUGCAGUGUACCCGCGGCAAGACGGAAAGACCCCGUGAACCUUUACUAUAGCUUGACACUGAACAUUGAGCCUUGAUGUGUAGGAUAGGUGGGAGGCUUUGAAGUGUGGACGCCAGUCUGCAUGGAGCCGACCUUGAAAUACCACCCUUUAAUGUUUGAUGUUCUAACGUUGACCCGUAAUCCGGGUUGCGGACAGUGUCUGGUGGGUAGUUUGACUGGGGCGGUCUCCUCCUAAAGAGUAACGGAGGAGCACGAAGGUUGGCUAAUCCUGGUCGGACAUCAGGAGGUUAGUGCAAUGGCAUAAGCCAGCUUGACUGCGAGCGUGACGGCGCGAGCAGGUGCGAAAGCAGGUCAUAGUGAUCCGGUGGUUCUGAAUGGAAGGGCCAUCGCUCAACGGAUAAAAGGUACUCCGGGGAUAACAGGCUGAUACCGCCCAAGAGUUCAUAUCGACGGCGGUGUUUGGCACCUCGAUGUCGGCUCAUCACAUCCUGGGGCUGAAGUAGGUCCCAAGGGUAUGGCUGUUCGCCAUUUAAAGUGGUACGCGAGCUGGGUUUAGAACGUCGUGAGACAGUUCGGUCCCUAUCUGCCGUGGGCGCUGGAGAACUGAGGGGGGCUGCUCCUAGUACGAGAGGACCGGAGUGGACGCAUCACUGGUGUUCGGGUUGUCAUGCCAAUGGCACUGCCCGGUAGCUAAAUGCGGAAGAGAUAAGUGCUGAAAGCAUCUAAGCACGAAACUUGCCCCGAGAUGAGUUCUCCCUGACCCUUUAAGGGUCCUGAAGGAACGUUGAAGACGACGACGUUGAUAGGCCGGGUGUGUAAGCGCAGCGAUGCGUUGAGCUAACCGGUACUAAUGAACCGUGAGGCUUAACCU"
import os
import pyhmmer
from pyhmmer.easel import (
TextSequence,
)
rna_alphabet = pyhmmer.easel.Alphabet.rna()
def _get_rna_hmm():
seeds = [TextSequence(name=os.urandom(4), sequence=s) for s in RNA23S_ALIGNED]
builder = pyhmmer.plan7.Builder(alphabet=rna_alphabet)
background = pyhmmer.plan7.Background(rna_alphabet)
anonymous_msa = pyhmmer.easel.TextMSA(os.urandom(8),sequences=seeds)
rna23s_HMM, _profile, _optmized_profile = builder.build_msa(anonymous_msa.digitize(rna_alphabet), background)
return rna23s_HMM
digital_rnaseq = TextSequence(name=b"target_rnaseq", sequence=target_rna_seq).digitize(rna_alphabet)
rna_symbols = set("".join([* RNA23S_ALIGNED, target_rna_seq])) # collapse everything to 5 letters ACUG-
for S in rna_symbols:
assert(S in rna_alphabet.symbols) # make sure there is no stray AAs in the seq
list(pyhmmer.hmmscan(digital_rnaseq, [ _get_rna_hmm() ]))
I can't for the life of me figure out what i'm missing here.Error:
<HMM alphabet=Alphabet.rna() M=2892 name=b'\xe1\xac\x89h!h:\x07'>
Traceback (most recent call last):
File "/home/rtviii/dev/phmmer_test/test.py", line 98, in <module>
list(pyhmmer.hmmscan(digital_rnaseq, [ _get_rna_hmm()]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rtviii/dev/phmmer_test/venv/lib/python3.11/site-packages/pyhmmer/hmmer.py", line 1481, in hmmscan
profile.configure(item, _background)
File "pyhmmer/plan7.pyx", line 7272, in pyhmmer.plan7.Profile.configure
File "pyhmmer/plan7.pyx", line 7304, in pyhmmer.plan7.Profile.configure
pyhmmer.errors.AlphabetMismatch: Expected RNA alphabet, found amino alphabet
This gave more pause also because when I just search all models one by one via plan7.pipeline hmm_search
-- it works, all other things being equal:
# ? Same module, process, variables as above..
...
def seq_evaluate_v_hmm(seq:str,alphabet:pyhmmer.easel.Alphabet, hmm:pyhmmer.plan7.HMM):
seq_ = pyhmmer.easel.TextSequence(name=b"NA", sequence=seq)
dsb = DigitalSequenceBlock(alphabet, [seq_.digitize(alphabet)])
return pyhmmer.plan7.Pipeline(alphabet=alphabet, T=100).search_hmm(hmm,dsb)
rna_hmm = _get_rna_hmm()
print(list(seq_evaluate_v_hmm(target_rna_seq, rna_alphabet, rna_hmm))[0])
><pyhmmer.plan7.Hit object at 0x7f32edf10160>
Figured it out, was a silly class/instance attribute mistake so HMMs persisted between objects. Thanks nonetheless!
I suppose this is an extremely unlikely usecase in the first place and i can circumvent this in my work, decided to write up anyway:
I have some 3-20 nucleotide long trna/mrna fragments sprinkled in with longer rna and protein chains i'm scanning against. Is there a way to enforce the
pyhmmer.easel.Alphabet.rna()
for those? For example, it infersamino
forCCA
, which i'm not sure where to override, though i see theFIXME
in the wrapper.Minimum reproducible example:
The err i'm getting: