TehMillhouse / PyMarkovChain

Simple markov chain implementation in python
Other
97 stars 17 forks source link

Add ability to do an nth order Markov chain #9

Closed asmeurer closed 10 years ago

asmeurer commented 10 years ago

If I understand the code for this correctly, this is just a first order Markov chain. So for instance, if you have a sentence, "The sky is blue", it maps "The" -> "sky", "sky" -> "is", and "is" -> "blue". A higher order one would also consider "The sky" -> "is", "sky is" -> "blue", and so on. A keyword argument to specify the order would probably be best, as there are probably disadvantages of having orders be too high.

TehMillhouse commented 10 years ago

Correct, this is just a first order Markov chain. However, extending it to be an n-th order chain is not just as simple as changing the way words are split, as a failing lookup for "The sky" should fall back to the next best thing: guessing solely based on "sky". (So a "correct" n-th order Markov chain needs to save all information that an (n-1)-th order chain needs, and then some.) It's not impossibly hard to implement, it's just something I haven't gotten around to doing :)

TehMillhouse commented 10 years ago

If you're specifically looking for an n-th order markov chain and don't want to wait until I get around to doing this (or doing it yourself), you might want to check out the implementation at https://github.com/Kha/err-babble-bot/blob/master/markov.py
I don't know how nice it is to use, but I've been told it's not that specialized as to be useless outside of that repo.

asmeurer commented 10 years ago

Depending on how bored I get, I might do it. Could be fun to learn more about Markov chains. Is there a good theoretical reference? Just Wikipedia?

asmeurer commented 10 years ago

What is the correct way to "fall back"? Should it first try n, and if there are no matches, then n - 1, and so on, or should it lump them all together somehow (and if so, should it somehow give the longer strings higher probabilities than the lower ones)?

asmeurer commented 10 years ago

I have started this at https://github.com/TehMillhouse/PyMarkovChain/pull/11.

I haven't yet written the text generation part, because I am not sure how to best do it (see the previous comment). Is there a mathematically "official" way to do it, based on the Markov theory?

I played with the err-babble thing with my data. It seemed to generate only sentences that were directly from the source text. I guess it placed too high of an emphasis on very long order strings.

asmeurer commented 10 years ago

I went with the "first try n, then if there are no matches n - 1, and so on" strategy.

asmeurer commented 10 years ago

By the way, n=2 appears to have a huge increase in comprehensibility over n=1. I'm skeptical whether I should even try larger n than that. It will probably just start giving me source sentences, especially given the strategy I am using.

TehMillhouse commented 10 years ago

As for the "official" way, the only defining characteristic of Markov chains is that their future states only depend on past states (modulo non-determinism). Apart from that it's pretty much free game.

And yeah, depending on your source corpus, your goals and the variability of the sentences you fed into the chain, high values of n won't give you much. The trick's striking a balance, and I'd guess depending on how large your chat log is, n=2 or n=3 might be a good tradeoff.

asmeurer commented 10 years ago

Well I'm pretty happy with what my implementation is giving me (n=2 with choosing 2 first and only 1 if there is no match for 2).