TromboneDavies / PolarOps

0 stars 0 forks source link

Create a "clean slate" place for our actual classifier #30

Closed divilian closed 3 years ago

divilian commented 3 years ago

Put it in github somewhere "fresh." Use existing code from any of us 4 as a guide, as necessary. Do the most vanilla thing:

  1. Basic text cleaning (word tokenizing, remove most puncutation, stemming, stopwords, and other Reddit-specific stuff.)
  2. Use techniques from Soliman paper as appropriate (bot detection, etc.)
  3. Count all stems and use only the n that most frequently occur
  4. Use only binary "does the stem appear, or not, in document X?" approach
  5. Throw it in Naive Bayes
  6. Run a plain old 70/30% training/test split
  7. Depress us all by giving us the accuracy number (Btw, for now let's just classify polar vs. non-polar, and ignore community.)
divilian commented 3 years ago

"classifier" directory.