energyapps/social-scraper

About

The purpose of these scripts is to collect and organize data that shows the various size of Department of Energy social media audiences by scraping social media follower data from twitter, instagram, youtube, and (if done with a temporary API key) facebook. It can be expanded to include other platforms in the future.

There are two scripts that are housed here. Both of them can be run locally to see how they work. They are:

Org Chart Data
- Collects the data for the DOE Social Media Org Chart
- Ideally will collect data either once a day or once a week. NOTE does not need to collect data hourly.
- Resulting data should look like this.
Hourly Follower Count
- Collects the data for the charts found at https://energyapps.github.io/social/followers and the matrix pages.
- Collects data hourly in order to track the growth of audience over time.
- Resulting data should look like this.

**The goal of this repo is get each of these scripts onto a Jenkin's Job and served onto https://energy.gov/api/social-media/ with read access allowed via CORS rules to energyapps.github.io/social.

Dependencies

Some of the python scripts make use of Beautiful Soup. The other packages it uses are csv, requests, json, and time.

Known Problems

Facebook scraping

While it is possible to scrape facebook user data using the API that they provide through developer tools, it is more complicated than scraping a public facing website. You are required to use an API key, but, from what I could tell, it expires frequently. I'm sure there is a developer tool that allows for a "set it and forget" method, but at the time I left the job, I hadn't had time to find this solution.

Therefore, this is why we do not track hourly facebook data in Hourly Follower Count. Additionally, it has been turned off on DOE Social Media Org Chart but can be reinstated any time someone wants to figure out a latent option.

Error Handling

At the time there is no elegant solution for noticing if things are broken. There are a few fail-safe's built in but they could be much improved.

[This ticket outlines what the needs are for scraping facebook with the API key]().

To Do List

write tickets for folks
Add Secretary's instagram account (@secretaryperry)
Add Energy Press Sec Twitter (@EnergyPressSec)
Install all python packages on the script server.
Install and test both scripts on the script server.
Ensure that Ernie and Atiq are receiving regular updates that these are working/
Find a way to make facebook numbers update automatically without having to manually insert a temporary API key into the script.
Ensure that energy.gov/api/social-media allows energyapps.github.io/social via favorable CORS rules.

Wish List

Make Hourly Follower Count charts explorable to focus on specific time frames.
Fix Matrix diagrams
figure out a way to have backup files.