NSAPH-Data-Processing / us_census_zcta_time_series

Using census package to extract acs5 data
MIT License
0 stars 0 forks source link

Table of Contents:

Introduction

The project streamines the extraction and analysis of demographic data from the American Community Survey 5-Year Data (ACS5). The project aims to provide cleaned data for each year from 2009-2019 for the required variables.

Notes about American Community Survey 5-Year Data:

Data Description

Codebook

Variable Name Description Derivation
pct_blk % of the population listed as black B02001_003E/B02001_001E
B02001_003E: Estimate!!Total!!Black or African American alone
B02001_001E: Estimate!!Total - Race
medhouseholdincome median household income B19013_001E
B19013_001E: : Median Household Income In The Past 12 Months In 2011 Inflation-Adjusted Dollars
pct_owner_occ % of housing units occupied by their owner For Years 2009 - 2014: (B11012_004E + B11012_008E + B11012_011E + B11012_014E)/ B11012_001E
B11012_004E: Estimate!!Total!!Family households!!Married-couple family!!Owner-occupied housing units
B11012_008E: Estimate!!Total!!Family households!!Other family!!Male householder, no wife present!!Owner-occupied housing units
B11012_011E: Estimate!!Total!!Family households!!Other family!!Female householder, no husband present!!Owner-occupied housing units B11012_014E: Estimate!!Total!!Nonfamily households!!Owner-occupied housing units
B11012_001E: Estimate!!Total - HOUSEHOLD TYPE BY TENURE
For Years 2015 - 2018: B25011_002E/B25011_001E
B25011_002E: Estimate!!Total!!Owner occupied
B25011_001E: Estimate!!Total: TENURE BY HOUSEHOLD TYPE (INCLUDING LIVING ALONE) AND AGE OF HOUSEHOLDER
hispanic % of the population identified as Hispanic, regardless of reported race B03003_003E/B03003_001E
B03003_003E: Estimate!!Total!!Hispanic or Latino
B03003_001E: Estimate!!Total
education % of the population older than 65 not graduating from high school (B15001_036E+B15001_037E+B15001_077E+B15001_078E)/(B15001_035E+B15001_076E)
B15001_036E: Estimate!!Total!!Male!!65 years and over!!Less than 9th grade
B15001_037E: Estimate!!Total!!Male!!65 years and over!!9th to 12th grade, no diploma
B15001_077E: Estimate!!Total!!Female!!65 years and over!!Less than 9th grade
B15001_078E: Estimate!!Total!!Female!!65 years and over!!9th to 12th grade, no diploma
B15001_035E: Estimate!!Total!!Male!!65 years and over
B15001_076E: Estimate!!Total!!Female!!65 years and over

Data Quality

The below table provides a comprehensive overview of the missing values in the generated dataset. It contains a record of null values for each variable across different years.

variable_name total_zcta 2011_null 2012_null 2013_null 2014_null 2015_null 2016_null 2017_null 2018_null
pct_blk 33120 369 337 336 306 310 321 317 321
medhouseholdincome 33120 0 0 0 0 0 0 0 0
pct_owner_occ 33120 615 589 596 573 571 580 573 578
hispanic 33120 369 337 336 306 310 321 317 321
education 33120 1199 1138 1103 1034 1024 1011 1046 1019

Repository Content

The repository contains:

Data Lineage

Processing Rules

Processing rules applied in census_zcta.py

To align with the aggregated nature of ACS estimates over 5-year periods, a specific processing rule is employed within the project. Each dataset generated from ACS data is internally tagged to a year that is 2 years prior. This tagging ensures that the data extracted in a given year corresponds to the ACS data collected 2 years later, providing consistency with the 5-year estimates.

For instance, when extracting data for the year 2020 from the ACS, the data is tagged internally as ACS 2018. This alignment respects the fact that the 2020 5-year estimates encompass ACS data collected from January 1, 2016, through December 31, 2020. This approach enables accurate and meaningful analysis while considering the temporal aggregation inherent in ACS data reporting.

Run

(I) Clone the repository

Clone the repository

git clone <https://github.com/<user>/repo>
cd <repo>

(II) Create Conda Environment Create conda environment using the requirements.yaml file

conda env create -f requirements.yml
conda activate <env_name> #environment name as found in requirements.yml

It is also possible to use mamba.

mamba env create -f requirements.yml
mamba activate <env_name>

(III) Create entrypoints Add symlinks to input, intermediate and output folders inside the corresponding /data subfolders.

For example:

export HOME_DIR=$(pwd)

cd $HOME_DIR/data/input/ .
ln -s <input_path> . #paths as found in data/input/README.md if any

cd $HOME_DIR/data/output/
ln -s <output_path> . #paths as found in data/output/README.md if any

The README.md files inside the /data subfolders contain path documentation for NSAPH internal purposes.

(IV) Run pipeline

Run the script for all years:

python ./<main_script>.py --year <year>

or run the pipeline:

snakemake --cores