LinuxMercedes / markovirc

Markov Chain IRC bot
3 stars 1 forks source link

Decapitalize words in database, use capitalization masks #18

Open brhoades opened 10 years ago

brhoades commented 10 years ago

In Marko's current implementation he stores every capitalized rendition of a word separately; "Brodes" will be kept as a separate word from "brodes". This causes word clutter in the database where entirely related words are kept separately stored. Seeborg's (https://code.google.com/p/seeborg/) fix for this issue was to just decapitalize every word incoming and store it that way. However, in English and many other languages, removing capitalization removes meaning from a sentence.

I would like to have a separate table stored for capitalization bitmasks, called capmasks for short. For any word taken in Marko would actively store its capitalization as a mask, either find it or enter it into capmasks. Then it would store the capmask id in the chain as a foreign key. Then the lowercase wordid would be referenced in the chain instead of one of the 20 different permutations of capitalization it may have.

For example, a capitalization bitmask for "aBcDefGHI" would be 010100111.