Closed asmeurer closed 10 years ago
Correct, this is just a first order Markov chain. However, extending it to be an n-th order chain is not just as simple as changing the way words are split, as a failing lookup for "The sky" should fall back to the next best thing: guessing solely based on "sky". (So a "correct" n-th order Markov chain needs to save all information that an (n-1)-th order chain needs, and then some.) It's not impossibly hard to implement, it's just something I haven't gotten around to doing :)
If you're specifically looking for an n-th order markov chain and don't want to wait until I get around to doing this (or doing it yourself), you might want to check out the implementation at
https://github.com/Kha/err-babble-bot/blob/master/markov.py
I don't know how nice it is to use, but I've been told it's not that specialized as to be useless outside of that repo.
Depending on how bored I get, I might do it. Could be fun to learn more about Markov chains. Is there a good theoretical reference? Just Wikipedia?
What is the correct way to "fall back"? Should it first try n, and if there are no matches, then n - 1, and so on, or should it lump them all together somehow (and if so, should it somehow give the longer strings higher probabilities than the lower ones)?
I have started this at https://github.com/TehMillhouse/PyMarkovChain/pull/11.
I haven't yet written the text generation part, because I am not sure how to best do it (see the previous comment). Is there a mathematically "official" way to do it, based on the Markov theory?
I played with the err-babble thing with my data. It seemed to generate only sentences that were directly from the source text. I guess it placed too high of an emphasis on very long order strings.
I went with the "first try n, then if there are no matches n - 1, and so on" strategy.
By the way, n=2
appears to have a huge increase in comprehensibility over n=1
. I'm skeptical whether I should even try larger n
than that. It will probably just start giving me source sentences, especially given the strategy I am using.
As for the "official" way, the only defining characteristic of Markov chains is that their future states only depend on past states (modulo non-determinism). Apart from that it's pretty much free game.
And yeah, depending on your source corpus, your goals and the variability of the sentences you fed into the chain, high values of n won't give you much. The trick's striking a balance, and I'd guess depending on how large your chat log is, n=2
or n=3
might be a good tradeoff.
Well I'm pretty happy with what my implementation is giving me (n=2 with choosing 2 first and only 1 if there is no match for 2).
If I understand the code for this correctly, this is just a first order Markov chain. So for instance, if you have a sentence, "The sky is blue", it maps "The" -> "sky", "sky" -> "is", and "is" -> "blue". A higher order one would also consider "The sky" -> "is", "sky is" -> "blue", and so on. A keyword argument to specify the order would probably be best, as there are probably disadvantages of having orders be too high.