Closed schlosser closed 4 years ago
Need to do some lightweight HTML scraping on all the HTML for various levels:
1. Start with the list of states, get the geoid. https://api.kevalaanalytics.com/geography/states/
geoid
2. For each state, get the list of geoids for all of the counties, senate districts, house districts, and congressional districts in the state: http://assessor.keva.la/cleanenergyprogress/geographies?state=56&type=XXX
Where XXX is one of: counties, legislativedistrictsupper, legislativedistrictslower, or congressionaldistricts.
counties
legislativedistrictsupper
legislativedistrictslower
congressionaldistricts
3. For each entity, scrape the HTML and pull data: http://assessor.keva.la/cleanenergyprogress/analytics?area_type=XXX&area_id=YYY Where XXX is a type (see above) and YYY is a geoid. Example query. For each entity, pull the following data:
XXX
YYY
countSolarJobs
countWindJobs
countEnergyJobs
totalJobs
percentOfStateJobs
residentialMWhInvested
commercialMWhInvested
utilityMWhInvested
totalMWhInvested
residentialDollarsInvested
commercialDollarsInvested
utilityDollarsInvested
totalDollarsInvested
investmentHomesEquivalent
countResidentialInstallations
countCommercialInstallations
countUtilityInstallations
countTotalInstallations
residentialMWCapacity
commercialMWCapacity
utilityMWCapacity
totalMWCapacity
Output format should be a folder of CSV files (named like MA.csv, VA.csv), one per state. In each file, include the following columns:
MA.csv
VA.csv
geoType
County
State Senate
State House
Congressional
State
name
number
sourceURL
...
We'll add those ~50 CSV files to data/cleaned/jobs/. Then, we'll write scripts to inject that data into the SQL database, but it will be easier to have all the data scraped and cleaned first.
data/cleaned/jobs/
can you specify what to scrape ?
Ahh, realizing this is a bit different than I thought. Will update the description.
Updated, please see above!
Accidental close!
This is done!
Need to do some lightweight HTML scraping on all the HTML for various levels:
1. Start with the list of states, get the
geoid
. https://api.kevalaanalytics.com/geography/states/2. For each state, get the list of geoids for all of the counties, senate districts, house districts, and congressional districts in the state: http://assessor.keva.la/cleanenergyprogress/geographies?state=56&type=XXX
Where XXX is one of:
counties
,legislativedistrictsupper
,legislativedistrictslower
, orcongressionaldistricts
.3. For each entity, scrape the HTML and pull data: http://assessor.keva.la/cleanenergyprogress/analytics?area_type=XXX&area_id=YYY Where
XXX
is a type (see above) andYYY
is ageoid
. Example query. For each entity, pull the following data:countSolarJobs
: Number of Solar jobscountWindJobs
: Number of Wind jobs (State level only)countEnergyJobs
: Number of Energy efficiency jobstotalJobs
: Total jobspercentOfStateJobs
: Percent of state total (Non-State level only)residentialMWhInvested
: MWh Investment in ResidentialcommercialMWhInvested
: MWh Investment in CommercialutilityMWhInvested
: MWh Investment in UtilitytotalMWhInvested
: MWh Investment totalresidentialDollarsInvested
: $USD Investment in ResidentialcommercialDollarsInvested
: $USD Investment in CommercialutilityDollarsInvested
: $USD Investment in UtilitytotalDollarsInvested
: $USD Investment totalinvestmentHomesEquivalent
: Number of equivalent homes totalcountResidentialInstallations
: Number of installations in ResidentialcountCommercialInstallations
: Number of installations in CommercialcountUtilityInstallations
: Number of installations in UtilitycountTotalInstallations
: Number of installations totalresidentialMWCapacity
: MW capacity in ResidentialcommercialMWCapacity
: MW capacity in CommercialutilityMWCapacity
: MW capacity in UtilitytotalMWCapacity
: MW totalOutput format should be a folder of CSV files (named like
MA.csv
,VA.csv
), one per state. In each file, include the following columns:geoType
: One ofCounty
,State Senate
,State House
,Congressional
,State
name
: County name, or N/Anumber
: District number, or N/AsourceURL
: HTML URL scraped, like this...
All of the data above.We'll add those ~50 CSV files to
data/cleaned/jobs/
. Then, we'll write scripts to inject that data into the SQL database, but it will be easier to have all the data scraped and cleaned first.