Data4Democracy / discursive

Twitter topic search and indexing with Elasticsearch
21 stars 11 forks source link

Build an analytics layer atop the Twitter data #3

Closed hadoopjax closed 7 years ago

hadoopjax commented 7 years ago

Ultimately, we'll want to build analytical products using the Twitter data as a source. To do that, we'll identify things like distinct user handles, capture referenced URLs and common hashtags. Additionally, we'll look to utilize some NLP techniques to dig deeper into the Tweet texts and profile descriptions.

The Twitter data is stored in S3 in JSON so pulling that data into an analytical environment (maybe a Jupyter notebook?) could be a great task for beginners to tackle as there's plenty of documentation on the interwebs describing how to perform data analysis using Python with JSON data sources!

jtsmith2 commented 7 years ago

I could work on this. Are you just looking for someone to create a notebook that pulls the data so then someone could then start right away without having to really know how to pull the data or are you looking for some actual analysis examples?

hadoopjax commented 7 years ago

So we have this thing that isn't terribly fancy. I think building it out to include anything that, putting your "I'm an analyst new to this kind of thing" hat on, you think would be worthwhile would be amazing (i.e. how to extract only screen_names from the data, getting a list of the locations [if they list them in the response]).

hadoopjax commented 7 years ago

Thoughts on closing this "epic" and breaking it into smaller chunks? a la https://github.com/Data4Democracy/discursive/issues/17?

alejandrox1 commented 7 years ago

Something like this or this ?

thrastarson commented 7 years ago

Seems like it would be useful to create a simple wrapper around the boto library for the s3 bucket. It could have functions like list_tweet_collections(), get_collection('tweets-25') etc. That is, if the idea is to store tweets in AWS.

alejandrox1 commented 7 years ago

I'll give it a try.

hadoopjax commented 7 years ago

@alejandrox1 I think you covered this well here so I'm going to close this one