Open TehMillhouse opened 11 years ago
This would require the markov chain to defer calculation of the occurrence probability until during text generation, but should be quite doable.
Not really. That would make the text generation slower, and you probably want to keep the slowest code-paths in the db generation code.
I suggest doing something a bit different: store word counts AND calculated probabilities in the db. Then, when adding more samples, simply add to those numbers and use the result to calculate the probabilities.
I think it'll be simpler to implement too.
Every time new information is added to the db, all probabilities shift as a result. If my current db (for any single word as last state) is {lol : 0.5, lol_wordcount : 3, haha : 0.5, haha_wordcount : 3}
, and I encounter rofl
, not only will rofl
and rofl_wordcount
be added, but all other probabilities will change too. So at that time, I either recalculate all probabilities, or defer this calculation until it's needed.
All I currently need for word selection (see _nextWord
) is that all candidates have a corresponding value, the sum of which is the upper bound of my randomly selected sample. I don't ever really need floating point numbers for that, I've just used them in my implementation because it's easier to reason about probabilities when they fall within the mathematical definition as numbers in the interval [0,1].
P.S. I think I just spotted a mathematical flaw in word selection, I'll open an issue as soon as I'm sure of it. Never mind.
Just to be clear: If this is done the right way, db generation would be a mere matter of counting words, and _nextWord
wouldn't change except for exchanging the call to random.random()
with random.randrange(self.wordcounttotals[lastword])
(where self.wordcounttotals[lastword]
is the total number of times lastword
was followed by another word)
Is there an update on this? It would be really useful for a project we're working on.
Sorry about the late answer. I'm quite busy otherwise at the moment, so it's unlikely I'll get around to implementing this in the near future.
This would require the markov chain to defer calculation of the occurrence probability until during text generation, but should be quite doable.
Also, switching the
_nextWord
function over to doing integer math will do away with rounding errors and will improve performance. Yay!