EricPostMaster / Are-You-Irish-Classifier

Streamlit application that uses Naive Bayes to assign users an "Irish-ness score" (the Murphy Index). ☘ Development process has application to spam detection on short-length documents.
MIT License
0 stars 0 forks source link

CountVectorizer does not remove all special characters #6

Open EricPostMaster opened 2 years ago

EricPostMaster commented 2 years ago

The current dataset has been manually adjusted to remove special characters and accent marks on names. I think it has something to do with the encoding of the the text from Wikipedia (Python says it's Western Europe encoding).