Open BrickGoat opened 1 year ago
I think we the data was scraped in a fairly clean state already, utilizing CountVectorization I think automatically tokenizes the bags of words and removes escape sequences/invalid characters. I was basically able to generate a list of the top 25 features (words) from each dataset, maybe we can split it by deleted and non-deleted comments and use that in our final model selection.
Yea I think CountVectorization works for the sequences and invalid characters. I did see there were partial urls in the bigrams, so I'll try to regex that out.
Note: Maybe creating a param grid and pipeline for each technique would make things simpler during model selection. Also Week 12 on asulearn covers each technique.