Data4Democracy / discursive

Twitter topic search and indexing with Elasticsearch
21 stars 11 forks source link

Build core NLP capability for analyzing Tweets #17

Closed hadoopjax closed 7 years ago

hadoopjax commented 7 years ago

This is a placeholder for the design (and associated discussion) of our foundational NLP capability to analyze collected Tweets. Divya (@divya on Slack) and Wendy (@wwymak on Slack) will be taking the lead with support from anyone/everyone else who wants to help! The goal will be to publish a proposed design for the implementation to this issue later this week and get feedback from the community. Anyone else who's interested in participating please don't hesitate to contact them.

divyanair91 commented 7 years ago

hey! excited to get started! i had pinged a bit with @hadoopjax on some ideas. here are some initial resources/ideas.

using this paper as a starting point: https://homes.cs.washington.edu/~mausam/papers/emnlp11.pdf

  1. they’ve got some great points that off the shelf parsers are really not going to cut it for social data because the structure of social data and tweets in particular is very unique. in addition they refer over and over to their annotated data sets being the key to their success (see this link: https://github.com/aritter/twitter_nlp/tree/master/data/annotated) — i think a first step will be to take a basic look at their training set vs ours using things like nltk/spacy and see if we trust an externally trained POS tagger vs building our own

  2. i’m mainly familiar with nltk myself but open to learning new tools like spacy so we’re on common ground. but this paper you shared with me actually mentions mallet (which i know of another team at my company using to great success) which is actually a java package but there are wrappers available for python. it uses LDA to find naturally occurring topics. — this might be a great second step especially since we’re looking at making some really sensitive claims on groups of people. instead of making our own assumptions about grouping, let the algorithm group conversation for us and we can see what we make of it from there.

wwymak commented 7 years ago

two other interesting papers around twitter:

http://fredericgodin.com/papers/Named%20Entity%20Recognition%20for%20Twitter%20Microposts%20using%20Distributed%20Word%20Representations.pdf

http://fredericgodin.com/papers/Alleviating%20Manual%20Feature%20Engineering%20for%20Part-of-Speech%20Tagging%20of%20Twitter%20Microposts%20using%20Distributed%20Word%20Representations.pdf

the first paper came with a gensim word2vec model trained on tweets which may come in handy for comparing 'normal' tweets vs tweets we are targetting

divyanair91 commented 7 years ago

@wwymak and I got together today and we proposed some jumping off points for anyone looking to get involved in this project. First thing we need to do is refine our corpus into an easy to use format and lay down the exploratory groundwork steps detailed below to understand what we're working with and arm us with the info to make a decision on how to move forward.

We volunteer to tackle steps 1 and 2 and keep updating new issues here.

  1. Look through data available at https://data.world/data4democracy/far-right as well as the awesome new data we have streaming in courtesy @sjacks26 and @bstarling and the other awesome peeps working on that stuff. See the following link if you wanna figure out how to plug into that: https://github.com/Data4Democracy/discursive/blob/master/README.md

    • Main goal here is to get a clean set of comments but we'll need to do some research before we define that (example: do we want retweets included or only original comments? my leaning is only original comments)
  2. We need to get the comments up to snuff for NLP analysis and make a corpus that's easy to plug into all the following steps. This means we need to:

    • Stem
    • Tokenize
    • Remove stop words -- List of stop words: https://pypi.python.org/pypi/stop-words
    • Tag POS Note: This seems simple because of all the great libraries available but note that every data source/set of comments is a little bit different so we don't always want to just plug and play -- remain skeptical!
  3. TFIDF - Term Frequency, Inverse Document Frequency. The basic idea here is that the more unique a word is, the more indicative it is of a specific topic, document, etc. and we want to figure out what those important words are. Examples:

  4. t-SNE - t-Distributed Stochastic Neighbor Embedding. It's often easier to figure out patterns when we see things. Let's pare down the dimensionality of our big token sets and make some pretty 2 or 3D plots. Examples:

  5. LDA - Latent Dirichlet Allocation. Let's figure out what topics naturally occur in our data to help broaden our understanding of this group and what they care about. Examples:

Useful libraries: spaCy NLTK sklearn TextBlob gensim Mallet

wwymak commented 7 years ago

So I have a chat with a colleague of mine who did some nlp research. The following are some of his recommendations:

His opinion is that gensim is a handy tool but he also built some extra utils etc for his work that may be useful: https://github.com/pelodelfuego/word2vec-toolbox (I also have the full data files for models trained on the whole of wikipedia corpus... although not sure if that will be useful here. the data files is 19GB

bstarling commented 7 years ago

Throwing this out there but there is some pretty sweet regex in this code to tokenize a tweet. Does some basics to capture emoji's, HTML tags and @mentions https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/

hadoopjax commented 7 years ago

Closing this - moved to Assemble