econ-dashboard / econ-dashboard.github.io

1 stars 0 forks source link

Begin Data Pipeline #26

Closed miguelito34 closed 3 years ago

miguelito34 commented 3 years ago

For sources above, aiming to aggregate data and output one csv per county --> metric, with each file containing all of the historical data. Idea here is to have one csv file per chart per county. That csv file would have the longitudinal data (yearly or quarterly) that would then be loaded in by vega-lite and embedded in the html page.

Data pipeline will live in the /scripts/data-wrangling folder and output csv's should live in the /site/county-data folder with.

miguelito34 commented 3 years ago

One consideration here will be how much data we have. As we scale, I can imagine having GB's of data and it may be worth thinking through how/where to store that. Relatively, it's not a ton of data but it may outgrow easy use on Github.

miguelito34 commented 3 years ago

The best way to download the QCEW data will likely be through url's. This should be fairly straightforward and scriptable. The only downside is that we can only get data back to 2016 using this process.

In general, we can download QCEW csv's for a given year and quarter, for all counties using a url structure such as below:

http://www.bls.gov/cew/data/api/{year}/{quarter}/industry/10.csv

Where year is any year in the last five and quarter is 1-4 (following FY quarters).

LAU data will also be similar:

https://www.bls.gov/lau/laucnty{zero padded decimal year}.txt

miguelito34 commented 3 years ago

Per discussion on 5/3 with @rsyoh-97, will go ahead with plan outlined above.

Will pull all available QCEW data from last 5 years using the provided url. From QCEW data, most relevant metrics are establishments and avg. weekly wage.

From lau data, will pull all available data (from back to 1990).

Re: data structure - within the /site/county-data/, will create a folder for each county and the relevant csv's will live within each county folder. From here, it should be easy for the templating/analysis scripts to pull in the relevant data for a given county's page.

miguelito34 commented 3 years ago

Considering the easiest way to keep track of data (since county names might be similar from state to state), will try and use the FIPS codes. Thus, may be best to have a single folder within /site/county-data/ for each county, and use the FIPS code as the name of the folder.

miguelito34 commented 3 years ago

For LAU data, we're currently pulling annual averages, but monthly data for the last 14 months is also available here if we want to consider that as well.