Data4Democracy / discursive

Twitter topic search and indexing with Elasticsearch
21 stars 11 forks source link

discursive

This tool searches Twitter for a collection of topics and stores the Tweet data in an Elasticsearch index and an S3 bucket. The intended use case is for social network composition and Tweet text analysis.

Setup

Everything you see here runs on AWS EC2 and the AWS Elasticsearch service. Currently, it runs just fine in the free tier. Things you will need include:

This sounds like a lot, but was quite quick to cobble together.

Scouring the Twitterverse

Once you have cloned the repo you're ready to rock:

  1. Install Docker on your EC2 instance using instructions appropriate for your OS (the code in this repo is run using Ubuntu).

  2. Change into the Discursive directory (i.e. cd discursive/).

  3. Run essetup.py which is located in the /config directory, which'll generate the Elasticsearch index with the appropriate mappings.

  4. Update the aws_config.py twitter_config.py esconn.py and s3conn.py files located in the /config directory with your credentials.

  5. Put your desired keyword(s) in the topics.txt file (one term per line).

  6. Edit the crontab file to run at your desired intervals. The default will run every fifteen minutes.

  7. Run sudo docker build -t discursive .

  8. Run sudo docker run discursive

  9. If all went well you're watching Tweets stream into your Elasticsearch index! Conversely, run index_twitter_search.py to search for specific topic(s) and bulk insert the data into your Elasticsearch index (and see the messages from Elasticsearch returned to your console).

  10. There are several options you may want to configure/tweak. For instance, you may want to turn off printing to console (which you can do in index_twitter_search.py) or run the container as a detached process. Please do jump into our Slack channel #assemble if you have any questions or log an issue!

Explore Twitter networks

A warning, this is experimental so check back often for updates. There are four important files for exploring the network of Tweets we've collected:

So, with some additional munging, you can use the above to build a graph of users, their followers and friends. When combined with the additional data we collect (tweet text, retweets, followers count, etc.) this represents the beginning of our effort to enable analysts by providing curated, network-analysis-ready data!

Where to find help

There is a chance setting all this up gives you problems. How great a chance? I don't care to speculate publically. I'm @nick on our Slack or you can file an issue here (please for my sanity just join us on Slack and let's talk there).

Want to use our infra?

I am a-ok with sharing access to the running instance of Elasticsearch until we get new infra up. I am even happy to take your search term requests and type them into my functioning configuration of this thing and have them indexed if you want to send them to me. I will do this for free because we're fam. Just ping me.

Current Work & Roadmap

Working with Docker