Twitter makes it hard to get all of a user's tweets (assuming they have more than 3200). This is a way to get around that using Python, Selenium, and Tweepy.
Essentially, we will use Selenium to open up a browser and automatically visit Twitter's search page, searching for a single user's tweets on a single day. If we want all tweets from 2015, we will check all 365 days / pages. This would be a nightmare to do manually, so the scrape.py
script does it all for you - all you have to do is input a date range and a twitter user handle, and wait for it to finish.
The scrape.py
script collects tweet ids. If you know a tweet's id number, you can get all the information available about that tweet using Tweepy - text, timestamp, number of retweets / replies / favorites, geolocation, etc. Tweepy uses Twitter's API, so you will need to get API keys. Once you have them, you can run the get_metadata.py
script.
python3
pip
or pip3
pip3 install selenium
pip3 install tweepy
scrape.py
and edit the user, start, and end variables (and save the file)python3 scrape.py
all_ids.json
no such file
error? you need to cd to the directory of scrape.py
scrape.py
and change the driver to use Chrome() or Firefox()scrape.py
and change the delay variable to 2 or 3sample_api_keys.json
filesample_api_keys.json
to api_keys.json
get_metadata.py
and edit the user variable (and save the file)python3 get_metadata.py
all_ids.json
username.json
(master file with all metadata)username.zip
(a zipped file of the master file with all metadata)username_short.json
(smaller master file with relevant metadata fields)username.csv
(csv version of the smaller master file)