fmicompbio / monaLisa

binned motif enrichment analysis and visualisation
https://fmicompbio.github.io/monaLisa/
GNU General Public License v3.0
36 stars 6 forks source link

The nature of the Markov models #66

Closed stianale closed 1 year ago

stianale commented 1 year ago

Hi,

I was simply wondering if you can provide me with information/literature on how the Markov models used in the getKmerFreq() function works, as I do not have a good understanding of these models. I could not find much information on this in the man pages.

Thanks, stianale

mbstadler commented 1 year ago

Hi @stianale

I think you will find a solid discussion in "Biological Sequence Analysis" (Cambridge University Press, Durbin R. et al) https://doi.org/10.1017/CBO9780511790492 (in particular chapter 3, although this book goes beyond Markov models and also discusses hidden Markov models in the same chapter).

Very briefly, the idea is that we calculate an expected count for a k-mer word using the probabilities of shorter sequence words. For example, in a zero-order Markov model, the probability of AAT p(AAT) would be the product of probabilities for A, A and T: p(A) p(A) p(T), which are estimated from the mono-nucleotide frequencies in the same set of sequences.

For higher order Markov model (say a model of order i), the probability of nucleotides depend on the previous i nucleotides, so for i=1 p(AAT) becomes p(A|start) p(A|A) p(T|A). This can be thought of as a model that is based on di-nucleotide and mono-nucleotide frequencies. Similarly, a model of order 2 (i=2) would be modeling tri-mers based on di-nucleotide frequencies, etc. In monaLisa::getKmerFreq(), we are estimating for example p(T|A) = frequency_AT / frequency_A. For numerical stability, this is done in log-space (variables lp_long and lp_short).

Finally, the expected frequency of a k-mer word (e.g. AAT) would be the probablity of observing the word at a given position, multiplied with the number of positions.

If you require more details, I am sure you will find them in the above book chapter.

I hope that answers you question. I will close the issue for the moment, but please feel free to re-open it if necessary.