Ability to add text samples to existing database

TehMillhouse / PyMarkovChain

Simple markov chain implementation in python

Other

97 stars 17 forks source link

Ability to add text samples to existing database #4

Open TehMillhouse opened 11 years ago

TehMillhouse commented 11 years ago

This would require the markov chain to defer calculation of the occurrence probability until during text generation, but should be quite doable.

Also, switching the _nextWord function over to doing integer math will do away with rounding errors and will improve performance. Yay!

elad661 commented 10 years ago

This would require the markov chain to defer calculation of the occurrence probability until during text generation, but should be quite doable.

Not really. That would make the text generation slower, and you probably want to keep the slowest code-paths in the db generation code.

I suggest doing something a bit different: store word counts AND calculated probabilities in the db. Then, when adding more samples, simply add to those numbers and use the result to calculate the probabilities.

I think it'll be simpler to implement too.

TehMillhouse commented 10 years ago

Every time new information is added to the db, all probabilities shift as a result. If my current db (for any single word as last state) is {lol : 0.5, lol_wordcount : 3, haha : 0.5, haha_wordcount : 3}, and I encounter rofl, not only will rofl and rofl_wordcount be added, but all other probabilities will change too. So at that time, I either recalculate all probabilities, or defer this calculation until it's needed.

All I currently need for word selection (see _nextWord) is that all candidates have a corresponding value, the sum of which is the upper bound of my randomly selected sample. I don't ever really need floating point numbers for that, I've just used them in my implementation because it's easier to reason about probabilities when they fall within the mathematical definition as numbers in the interval [0,1].

~~P.S. I think I just spotted a mathematical flaw in word selection, I'll open an issue as soon as I'm sure of it.~~ Never mind.

TehMillhouse commented 10 years ago

Just to be clear: If this is done the right way, db generation would be a mere matter of counting words, and _nextWord wouldn't change except for exchanging the call to random.random() with random.randrange(self.wordcounttotals[lastword]) (where self.wordcounttotals[lastword] is the total number of times lastword was followed by another word)

kach commented 9 years ago

Is there an update on this? It would be really useful for a project we're working on.

TehMillhouse commented 9 years ago

Sorry about the late answer. I'm quite busy otherwise at the moment, so it's unlikely I'll get around to implementing this in the near future.