TehMillhouse / PyMarkovChain

Simple markov chain implementation in python
Other
97 stars 17 forks source link

Add a way to make words compare equal #10

Open asmeurer opened 10 years ago

asmeurer commented 10 years ago

I've got source text that has some word variations that I'd like to ignore. There are the basic case sensitivity issues. If it were just that I could str.lower everything, but I'm also dealing with source text with a lot of misspellings. I think I might be able to get around this by using some ratio function from difflib. It would be useful to be able to supply a function which would be used to compare two strings and if it returns True, consider them to be equal.

TehMillhouse commented 10 years ago

As it is right now, string generation is largely based on dictionary lookup, so as long as there's no such thing as "fuzzy dictionary lookup", it doesn't look like this will be a simple addition.

One thing you could do is classifying different words into equivalence classes (for example, by assuming the first occurrence of the word has the right spelling, and by seeing all words with Levenshtein distance <= 1 to this as other occurrences of the same word when counting), but as long as you don't keep around a pre-selected dictionary of properly-spelled words, there will be glitches when generating sentences (like when the first occurrence of a word is already misspelled, or when you encounter similarly-spelled words).

asmeurer commented 10 years ago

Yeah, getting the "right" word is a separate issue.

I think this can be done using some dict subclass. I guess you'd have to be careful about making it picklable, though. I might play with it. I also didn't think about the fact that two words might be close to a third word, but far from one another. One could theoretically chain words arbitrarily far from one another (like one of these).

Probably this is a much harder problem than I think it is. Another issue is things like the fact that "if" and "of" are really close but should remain separate.

My primary motivation here is to try to get more coherent sentences out of this. My source text is some Google Hangouts chat (https://github.com/asmeurer/markov), which has a lot of mispelling and autocorrectisms. But I'm now thinking that fixing https://github.com/TehMillhouse/PyMarkovChain/issues/9 is a better way to achieve this.

TehMillhouse commented 10 years ago

I'd reckon upgrading the markov chain to an nth order chain would already make it "good enough" for chatbot shenanigans.
We're using that err plugin I've linked to before for our chatbot and it routinely fools people. The misspellings do reduce variation, but only add to the illusion.

If you want more cohesive sentences than what a straight up markov chain can give you, you can use mispy's approach (explained in here, code (Ruby) here). For most things that are more sophisticated you'd have to properly parse human language, which doesn't seem like it's worth the effort for something like a chatbot.

asmeurer commented 10 years ago

Yeah, my goal is entirely humor value, not to fool anyone. Well that and I wanted to learn about markov chains.