Create a "clean slate" place for our actual classifier

Put it in github somewhere "fresh." Use existing code from any of us 4 as a guide, as necessary. Do the most vanilla thing:

Basic text cleaning (word tokenizing, remove most puncutation, stemming, stopwords, and other Reddit-specific stuff.)
Use techniques from Soliman paper as appropriate (bot detection, etc.)
Count all stems and use only the n that most frequently occur
Use only binary "does the stem appear, or not, in document X?" approach
Throw it in Naive Bayes
Run a plain old 70/30% training/test split
Depress us all by giving us the accuracy number (Btw, for now let's just classify polar vs. non-polar, and ignore community.)

TromboneDavies / PolarOps