strange behavior using `hmmemit` from single-sequence HMM

snayfach commented 3 months ago

I noticed that hmmemit was generating very novel sequences from my HMM. So I decided to check the behavior of hmmemit -N 100 profile.hmm for an HMM composed of just one protein, expecting the emitted sequence to be identical to the seed, since all the probabilities should be 1.0. Strangely, the emitted sequences display on average 30% identity to the single seed sequence.

Any help understanding this behavior would be great. Thank you

cryptogenomicon commented 3 months ago

Yes, that's expected behavior. The profile HMM we build is a statistical model of the expected remote homologs of the input MSA (or single sequence), not just of the input. Counts/frequencies of the input are extrapolated to remote homologs in two ways: by mixture Dirichlet priors, and by an ad hoc entropy weighting technique. You can turn both of these off with hmmbuild --pnone --enone, and now models of single sequences will behave as you expected.

For multiple sequence alignments, there is one additional way that counts are altered, which is relative sequence weighting (to downweight closely related sequences). To shut that off too, and get models that strictly reflect the frequencies of the input MSA (I'm pretty sure; I don't think I forgot any other flags), use hmmbuild --pnone --enone --wnone.

snayfach commented 3 months ago

Using hmmbuild --pnone --enone resulted in the expected behavior. Thanks for the quick reply

EddyRivasLab / hmmer

strange behavior using `hmmemit` from single-sequence HMM #329