EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
307 stars 69 forks source link

hmmsearch doesn't take long translated sequences as input #244

Closed fannyhb closed 2 years ago

fannyhb commented 3 years ago

Hi,

previously I have been able to translate long sequences (i.e. genomes) with transeq and then feed them to hmmsearch. Transeq generates a star (*) at the end of translation, and then the same sequence continues (under the same header). Now, it seems like hmmsearch no longer can handle this kind of input, and this error message is given:

Fatal exception (source file p7_pipeline.c, line 697): Target sequence length > 100K, over comparison pipeline limit. (Did you mean to use nhmmer/nhmmscan?)

Is this a feature or a bug? Or have I got it wrong and it should be able to handle this kind of input?

best regards, Fanny

cryptogenomicon commented 3 years ago

That has actually never worked, so your previous results using this strategy may not be correct. HMMER uses a probability model of a complete protein sequence, so it expects sequences to be individual proteins. (For one reason why: HMMER's trying to tell you which proteins are homologous, and if you only give it 6 superlong "proteins" for a complete genome or chromosome, output telling you which of the six frames contains homologs isn't super useful.) Longest known proteins are about 30K-40K long. Additionally, because of some numerical stability issues with small transition probabilities in profile HMMs, our dynamic programming algorithms are only guaranteed to work up to a max of 100K.

Recently we added an explicit check that people weren't giving the sort of inputs you are, and that's the error you're getting.

The fix is to translate to individual ORF sequences. (Note that * is not a legal IUPAC amino acid residue character either, so best not to use those.)

A while ago I wrote a blog post with more information about this.