Closed snayfach closed 3 months ago
Yes, that's expected behavior. The profile HMM we build is a statistical model of the expected remote homologs of the input MSA (or single sequence), not just of the input. Counts/frequencies of the input are extrapolated to remote homologs in two ways: by mixture Dirichlet priors, and by an ad hoc entropy weighting technique. You can turn both of these off with hmmbuild --pnone --enone
, and now models of single sequences will behave as you expected.
For multiple sequence alignments, there is one additional way that counts are altered, which is relative sequence weighting (to downweight closely related sequences). To shut that off too, and get models that strictly reflect the frequencies of the input MSA (I'm pretty sure; I don't think I forgot any other flags), use hmmbuild --pnone --enone --wnone
.
Using hmmbuild --pnone --enone
resulted in the expected behavior. Thanks for the quick reply
I noticed that
hmmemit
was generating very novel sequences from my HMM. So I decided to check the behavior ofhmmemit -N 100 profile.hmm
for an HMM composed of just one protein, expecting the emitted sequence to be identical to the seed, since all the probabilities should be 1.0. Strangely, the emitted sequences display on average 30% identity to the single seed sequence.Any help understanding this behavior would be great. Thank you