evoldoers / machineboss

Bioinformatics Open Source Sequence machine
BSD 3-Clause "New" or "Revised" License
33 stars 7 forks source link

Implement HMMER's handling of X's (for protein) and N's (for DNA) #117

Open ihh opened 4 years ago

ihh commented 4 years ago

HMMER weights IUPAC degenerate emissions using the reciprocal of the perplexity of the underlying match state (see esl_abc_FExpectScore function in HMMER3 source)

This has the effect that the "score" for those emissions is the expectation of what you'd get if you randomized X's using the underlying emission distribution - much to the chagrin of Roger Sewell, who argued they should be treated as missing data (Sean's counterargument is that this would reward their alignment to the model) - this is an old argument

Practically (as noted by @jordisr) this affects <1% of sequences, but for full hmmer compatibility we ought to include it.