TehMillhouse / PyMarkovChain

Simple markov chain implementation in python
Other
97 stars 17 forks source link

If the input data has extremely long lines, generateString can cause a RuntimeError #1

Closed lorenzhs closed 11 years ago

lorenzhs commented 11 years ago

I used a book from the Gutenberg Project as input file. A line in the input file corresponds to a paragraph in the book, and as PyMarkovChain uses lines as the delimiting unit, when calling generateString, I either get really long text or a RuntimeError, as recursion depth has been exceeded in _accumulateWithSeed:

>>> print a.generateString()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "MarkovChain.py", line 76, in generateString
    return self._accumulateWithSeed("", "")
  File "MarkovChain.py", line 103, in _accumulateWithSeed
    return self._accumulateWithSeed(sentence + sep + lastWord, nextWord)
[...]
  File "MarkovChain.py", line 103, in _accumulateWithSeed
    return self._accumulateWithSeed(sentence + sep + lastWord, nextWord)
  File "MarkovChain.py", line 96, in _accumulateWithSeed
    nextWord = self._nextWord(lastWord)
  File "MarkovChain.py", line 114, in _nextWord
    if probmap[candidate] > maxprob:
RuntimeError: maximum recursion depth exceeded in cmp
lorenzhs commented 11 years ago

One possibility would be to use nltk (which, on the other hand, would be a huge dependency and kinda overkill) to split text into sentences (and, if you're already using it, sentences into words a little cleverer), or try to accomplish something similar with regular expressions.

TehMillhouse commented 11 years ago

Thanks for pointing this out. Can you post a link to the file you're trying to process? That way I know what I'm up against.

Yes, currently I'm assuming the input text to have \n characters between "sentences", so for your current application the quickest fix would be to run sed 's/\./&\n/g' over this file and then pipe it into the markov chain. I think the way to go is to let you specify a sentence delimiter, and only use \n as a default. While I'm doing that, I'll convert it from using str.split() to using regular expressions, as that will allow the preprocessing to run with the memory efficiency of a generator.

I really want to avoid pulling in any dependencies for this thing, it's supposed to be 'plug and play' (if that makes sense).

TehMillhouse commented 11 years ago

Also, maybe having _accumulateWithSeed be recursive isn't a good idea in general, as python implementations usually don't do tail call optimization. I'll address this separately, see #2.

lorenzhs commented 11 years ago

Sure! I couldn't find it online anymore, so here's a paste: http://bpaste.net/show/OanxZ3ZUoDjCTQdZLxOn/

Yes, using nltk would limit the ease of use tremendously, I completely understand that you don't want to use it, nor does it really make sense. Some way of specifying how to split text into sentences and words seems to me like the way to go, too.

Also, re.finditer looks good. I don't really know how re.split() is implemented.

TehMillhouse commented 11 years ago

Ah, nice. I was thinking of building something around re.search(), but re.finditer() is perfect, thanks. I'll look into it right away.

EDIT: Oh, I forgot to mention: re.split() returns a list, not an iterator, so it constructs the entire list in memory, which is exactly what I'm trying to avoid.