The project streamines the extraction and analysis of demographic data from the American Community Survey 5-Year Data (ACS5). The project aims to provide cleaned data for each year from 2009-2019 for the required variables.
Notes about American Community Survey 5-Year Data:
The American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population.
ACS data is available from 2009 onwards; however, ZCTA (ZIP Code Tabulation Area) level information is available from 2011 onwards.
The ACS is an ongoing survey that collects responses every day of the year. The ACS estimates do not represent a specific point in time during the collection period, but rather a pooling of the data collected during the entire period. For 1-year estimates, the ACS uses data collected in that calendar year — January 1 through December 31. Similarly, the 5-year estimates use data collected over a 5-year period. For example, the 2016-2020 5-year estimates will use ACS data collected from January 1, 2016, through December 31, 2020. Read further about the period estimates here.
Variable Name | Description | Derivation |
---|---|---|
pct_blk | % of the population listed as black | B02001_003E/B02001_001E B02001_003E: Estimate!!Total!!Black or African American alone B02001_001E: Estimate!!Total - Race |
medhouseholdincome | median household income | B19013_001E B19013_001E: : Median Household Income In The Past 12 Months In 2011 Inflation-Adjusted Dollars |
pct_owner_occ | % of housing units occupied by their owner | For Years 2009 - 2014: (B11012_004E + B11012_008E + B11012_011E + B11012_014E)/ B11012_001E B11012_004E: Estimate!!Total!!Family households!!Married-couple family!!Owner-occupied housing units B11012_008E: Estimate!!Total!!Family households!!Other family!!Male householder, no wife present!!Owner-occupied housing units B11012_011E: Estimate!!Total!!Family households!!Other family!!Female householder, no husband present!!Owner-occupied housing units B11012_014E: Estimate!!Total!!Nonfamily households!!Owner-occupied housing units B11012_001E: Estimate!!Total - HOUSEHOLD TYPE BY TENURE For Years 2015 - 2018: B25011_002E/B25011_001E B25011_002E: Estimate!!Total!!Owner occupied B25011_001E: Estimate!!Total: TENURE BY HOUSEHOLD TYPE (INCLUDING LIVING ALONE) AND AGE OF HOUSEHOLDER |
hispanic | % of the population identified as Hispanic, regardless of reported race | B03003_003E/B03003_001E B03003_003E: Estimate!!Total!!Hispanic or Latino B03003_001E: Estimate!!Total |
education | % of the population older than 65 not graduating from high school | (B15001_036E+B15001_037E+B15001_077E+B15001_078E)/(B15001_035E+B15001_076E) B15001_036E: Estimate!!Total!!Male!!65 years and over!!Less than 9th grade B15001_037E: Estimate!!Total!!Male!!65 years and over!!9th to 12th grade, no diploma B15001_077E: Estimate!!Total!!Female!!65 years and over!!Less than 9th grade B15001_078E: Estimate!!Total!!Female!!65 years and over!!9th to 12th grade, no diploma B15001_035E: Estimate!!Total!!Male!!65 years and over B15001_076E: Estimate!!Total!!Female!!65 years and over |
The below table provides a comprehensive overview of the missing values in the generated dataset. It contains a record of null values for each variable across different years.
variable_name | total_zcta | 2011_null | 2012_null | 2013_null | 2014_null | 2015_null | 2016_null | 2017_null | 2018_null |
---|---|---|---|---|---|---|---|---|---|
pct_blk | 33120 | 369 | 337 | 336 | 306 | 310 | 321 | 317 | 321 |
medhouseholdincome | 33120 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
pct_owner_occ | 33120 | 615 | 589 | 596 | 573 | 571 | 580 | 573 | 578 |
hispanic | 33120 | 369 | 337 | 336 | 306 | 310 | 321 | 317 | 321 |
education | 33120 | 1199 | 1138 | 1103 | 1034 | 1024 | 1011 | 1046 | 1019 |
The repository contains:
Data Source :The primary data source for this project is the American Community Survey 5-Year Data (ACS5), which is publicly available and maintained by U.S. Census Bureau. The ACS5 data provides a comprehensive snapshot of various demographic variables.
Extraction : We leverage a Python wrapper to efficiently extract data from the US Census Bureau's API. This wrapper provides us with direct access to ACS and SF1 datasets, facilitating swift retrieval of the specific variables necessary for subsequent analysis and processing.
Processing & Final Dataset : We transform the subset of variables obtained from the API and generate the final datasets for each respective year.
Processing rules applied in census_zcta.py
To align with the aggregated nature of ACS estimates over 5-year periods, a specific processing rule is employed within the project. Each dataset generated from ACS data is internally tagged to a year that is 2 years prior. This tagging ensures that the data extracted in a given year corresponds to the ACS data collected 2 years later, providing consistency with the 5-year estimates.
For instance, when extracting data for the year 2020 from the ACS, the data is tagged internally as ACS 2018. This alignment respects the fact that the 2020 5-year estimates encompass ACS data collected from January 1, 2016, through December 31, 2020. This approach enables accurate and meaningful analysis while considering the temporal aggregation inherent in ACS data reporting.
(I) Clone the repository
Clone the repository
git clone <https://github.com/<user>/repo>
cd <repo>
(II) Create Conda Environment Create conda environment using the requirements.yaml file
conda env create -f requirements.yml
conda activate <env_name> #environment name as found in requirements.yml
It is also possible to use mamba
.
mamba env create -f requirements.yml
mamba activate <env_name>
(III) Create entrypoints
Add symlinks to input, intermediate and output folders inside the corresponding /data
subfolders.
For example:
export HOME_DIR=$(pwd)
cd $HOME_DIR/data/input/ .
ln -s <input_path> . #paths as found in data/input/README.md if any
cd $HOME_DIR/data/output/
ln -s <output_path> . #paths as found in data/output/README.md if any
The README.md files inside the /data
subfolders contain path documentation for NSAPH internal purposes.
(IV) Run pipeline
Run the script for all years:
python ./<main_script>.py --year <year>
or run the pipeline:
snakemake --cores