Data4Democracy / assemble

NOT AN ACTIVE PROJECT -- Check readme for data sources
MIT License
36 stars 27 forks source link

Tweet text data parsing/cleaning for nlp #25

Open wwymak opened 7 years ago

wwymak commented 7 years ago

Some of the tasks we might do are:

Depending on what you want to achieve, you might not need all of the above (e.g. for training word2vec, you might not need to do any of that, but you might want to convert emojis)

Useful libraries: spaCy NLTK sklearn TextBlob gensim Mallet

I'm exploring what is possible/needed at the mo with @divya -- but feel free to chip in with opinions, ideas, especially if you're an nlp expert :)

jss367 commented 7 years ago

I saw the "help wanted" tag on this so I built a notebook that inputs tweets, then tokenizes, removes stop words, and stems the tweets. It's called CleanText.ipynb if you want to take a look at it. I'd be happy to make changes or additions if you have any suggestions.