energyapps / social-scraper

Scraper scripts that collect social media information for energyapps.github.io/social
0 stars 2 forks source link

About

The purpose of these scripts is to collect and organize data that shows the various size of Department of Energy social media audiences by scraping social media follower data from twitter, instagram, youtube, and (if done with a temporary API key) facebook. It can be expanded to include other platforms in the future.

There are two scripts that are housed here. Both of them can be run locally to see how they work. They are:

  1. Org Chart Data
    • Collects the data for the DOE Social Media Org Chart
    • Ideally will collect data either once a day or once a week. NOTE does not need to collect data hourly.
    • Resulting data should look like this.
  2. Hourly Follower Count

**The goal of this repo is get each of these scripts onto a Jenkin's Job and served onto https://energy.gov/api/social-media/ with read access allowed via CORS rules to energyapps.github.io/social.

Dependencies

Directory

Known Problems

Facebook scraping

While it is possible to scrape facebook user data using the API that they provide through developer tools, it is more complicated than scraping a public facing website. You are required to use an API key, but, from what I could tell, it expires frequently. I'm sure there is a developer tool that allows for a "set it and forget" method, but at the time I left the job, I hadn't had time to find this solution.

Therefore, this is why we do not track hourly facebook data in Hourly Follower Count. Additionally, it has been turned off on DOE Social Media Org Chart but can be reinstated any time someone wants to figure out a latent option.

Error Handling

At the time there is no elegant solution for noticing if things are broken. There are a few fail-safe's built in but they could be much improved.

[This ticket outlines what the needs are for scraping facebook with the API key]().

To Do List

Wish List