Data Preprocessing - Githubissues

BrickGoat / reddit-comment-moderator

Webscraper and ML model for predicting whether a comment will be deleted by a subreddit moderator

0 stars 0 forks source link

Data Preprocessing #2

Open BrickGoat opened 1 year ago

BrickGoat commented 1 year ago

[ ] Clean comment bodies with regex
- [x] Remove escape sequences
- [x] Remove emojis & invalid characters
- [x] Look for other issues
[x] **Use techniques from class to create multiple datasets that can be split and used in model selection.
- [x] CountVectorizer
- [x] TF-IDF
- [x] N-grams

Note: Maybe creating a param grid and pipeline for each technique would make things simpler during model selection. Also Week 12 on asulearn covers each technique.

kylekennedy26 commented 1 year ago

I think we the data was scraped in a fairly clean state already, utilizing CountVectorization I think automatically tokenizes the bags of words and removes escape sequences/invalid characters. I was basically able to generate a list of the top 25 features (words) from each dataset, maybe we can split it by deleted and non-deleted comments and use that in our final model selection.

BrickGoat commented 1 year ago

Yea I think CountVectorization works for the sequences and invalid characters. I did see there were partial urls in the bigrams, so I'll try to regex that out.