Data4Democracy / assemble

NOT AN ACTIVE PROJECT -- Check readme for data sources
MIT License
36 stars 27 forks source link

Word2Vec models #26

Open wwymak opened 7 years ago

wwymak commented 7 years ago

Construct word2vec model with tweets for groups of people (e.g. far right) and compare with models trained on the overall twitterverse (e.g. http://fredericgodin.com/papers/Named%20Entity%20Recognition%20for%20Twitter%20Microposts%20using%20Distributed%20Word%20Representations.pdf)

Some things to try: clustering tweets with tSNE/kMeans/PCA predict hashtags with tweets vectors do regression on tweet/hashtag vectors

(notes from a chat with a colleague of mine who did some nlp research. The following are some of his recommendations:

using word2vec is more going to give better results compared to e.g. countVectorizer use word2vec with skipgram training for the tweets themselves there probably is no need to remove stop words or tokenize tweets (but remove punctuation) convert emojis into e.g. happy to get better context convert word2vec vectors into polar coordinates train word2vec for hashtags from tweets using cbow

His opinion is that gensim is a handy tool but he also built some extra utils etc for his work that may be useful: https://github.com/pelodelfuego/word2vec-toolbox )

I have been tinkering a bit with the our data using gensim (seems fairly easy to use although I haven't actually tried seeing what falls out of it yet)

patrick-dd commented 7 years ago

Starting on this

hadoopjax commented 7 years ago

Great to hear @patrick-dd thanks for picking this up! I invited you to the D4D organization so you can be assigned the issue (helps us track who's working on what).

wwymak commented 7 years ago

looks like me and @patrick-dd is going to work from the two different ends of the problem and maybe with luck meet in the middle :) Just thought I'd add in that anyone else who is interested is welcome since it'll be useful to get different insights into this task