Pull in Clean Energy Jobs data - Githubissues

covega / enviro_papers

Take datasets on the environment and slot them into candidate specific research papers

MIT License

0 stars 0 forks source link

Pull in Clean Energy Jobs data #12

Closed schlosser closed 4 years ago

schlosser commented 4 years ago

Need to do some lightweight HTML scraping on all the HTML for various levels:

1. Start with the list of states, get the geoid. https://api.kevalaanalytics.com/geography/states/

2. For each state, get the list of geoids for all of the counties, senate districts, house districts, and congressional districts in the state: http://assessor.keva.la/cleanenergyprogress/geographies?state=56&type=XXX

Where XXX is one of: counties, legislativedistrictsupper, legislativedistrictslower, or congressionaldistricts.

3. For each entity, scrape the HTML and pull data: http://assessor.keva.la/cleanenergyprogress/analytics?area_type=XXX&area_id=YYY Where XXX is a type (see above) and YYY is a geoid. Example query. For each entity, pull the following data:

countSolarJobs: Number of Solar jobs
countWindJobs: Number of Wind jobs (State level only)
countEnergyJobs: Number of Energy efficiency jobs
totalJobs: Total jobs
percentOfStateJobs: Percent of state total (Non-State level only)
residentialMWhInvested: MWh Investment in Residential
commercialMWhInvested: MWh Investment in Commercial
utilityMWhInvested: MWh Investment in Utility
totalMWhInvested: MWh Investment total
residentialDollarsInvested: $USD Investment in Residential
commercialDollarsInvested: $USD Investment in Commercial
utilityDollarsInvested: $USD Investment in Utility
totalDollarsInvested: $USD Investment total
investmentHomesEquivalent: Number of equivalent homes total
countResidentialInstallations: Number of installations in Residential
countCommercialInstallations: Number of installations in Commercial
countUtilityInstallations: Number of installations in Utility
countTotalInstallations: Number of installations total
residentialMWCapacity: MW capacity in Residential
commercialMWCapacity: MW capacity in Commercial
utilityMWCapacity: MW capacity in Utility
totalMWCapacity: MW total

Output format should be a folder of CSV files (named like MA.csv, VA.csv), one per state. In each file, include the following columns:

geoType: One of County, State Senate, State House, Congressional, State
name: County name, or N/A
number: District number, or N/A
sourceURL: HTML URL scraped, like this
... All of the data above.

We'll add those ~50 CSV files to data/cleaned/jobs/. Then, we'll write scripts to inject that data into the SQL database, but it will be easier to have all the data scraped and cleaned first.

saswat01 commented 4 years ago

can you specify what to scrape ?

schlosser commented 4 years ago

Ahh, realizing this is a bit different than I thought. Will update the description.

schlosser commented 4 years ago

Updated, please see above!

schlosser commented 4 years ago

Accidental close!

schlosser commented 4 years ago

This is done!