hmmsearch doesn't take long translated sequences as input

EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs

Other

307 stars 69 forks source link

That has actually never worked, so your previous results using this strategy may not be correct. HMMER uses a probability model of a complete protein sequence, so it expects sequences to be individual proteins. (For one reason why: HMMER's trying to tell you which proteins are homologous, and if you only give it 6 superlong "proteins" for a complete genome or chromosome, output telling you which of the six frames contains homologs isn't super useful.) Longest known proteins are about 30K-40K long. Additionally, because of some numerical stability issues with small transition probabilities in profile HMMs, our dynamic programming algorithms are only guaranteed to work up to a max of 100K.

Recently we added an explicit check that people weren't giving the sort of inputs you are, and that's the error you're getting.

The fix is to translate to individual ORF sequences. (Note that * is not a legal IUPAC amino acid residue character either, so best not to use those.)

A while ago I wrote a blog post with more information about this.

EddyRivasLab / hmmer

hmmsearch doesn't take long translated sequences as input #244