BrickGoat / reddit-comment-moderator

Webscraper and ML model for predicting whether a comment will be deleted by a subreddit moderator
0 stars 0 forks source link

Data Preprocessing #2

Open BrickGoat opened 1 year ago

BrickGoat commented 1 year ago

Note: Maybe creating a param grid and pipeline for each technique would make things simpler during model selection. Also Week 12 on asulearn covers each technique.

kylekennedy26 commented 1 year ago

I think we the data was scraped in a fairly clean state already, utilizing CountVectorization I think automatically tokenizes the bags of words and removes escape sequences/invalid characters. I was basically able to generate a list of the top 25 features (words) from each dataset, maybe we can split it by deleted and non-deleted comments and use that in our final model selection.

BrickGoat commented 1 year ago

Yea I think CountVectorization works for the sequences and invalid characters. I did see there were partial urls in the bigrams, so I'll try to regex that out.