Zip code to county master crosswalk

This pipeline pulls crosswalks from the U.S. Department of Housing and Urban Development (HUD) database, compiling a comprehensive ZIP code to county crosswalk from 2010 to 2023.

ZIP Codes are updated on a regular basis. Here is an example announcement from the USPS.

In order to run the pipeline, build a conda environment with the following command.

conda env create -f requirements.yaml
conda activate zip2county_master_xwalk

It is also possible to use mamba using the same commands.

You need to have an API token for the HUD database in order to use the pipeline. Instructions on how to quickly and freely obtain an API token can be found at this link. Make sure to export the API as a global variable.

export HUD_API_TOKEN="your-token-here"

Link entrypoints to data placeholders Add symlinks to input, intermediate and output folders inside the corresponding /data subfolders by running:

python utils/create_data_paths.py

For nsaph users python utils/create_data_paths.py datapaths=cannon_datapaths

snakemake is the preferred way to run the pipeline. To run the pipeline with default parameters, simply run:

snakemake --cores 1

To modify any of the default parameters, modify the config.yaml file or pass the -C flag to snakemake followed by your desired parameters.

snakemake --cores 1 -C min_year={min_year} max_year={max_year} criteria={criteria} xwalk_method={xwalk_method}

Dockerized Pipeline

Create the folder where you would like to store the output dataset.

mkdir <output_path>

A multi-platform built image is available under nsaph/zip2county_master_xwalk:latest. To run the docker container do

docker run -v <output_path>/:/app/data/output --env HUD_API_TOKEN=$HUD_API_TOKEN <image_name>

If you are also interested in storing the raw and intermediate data run

docker run -v <output_path:/app/data/ --env HUD_API_TOKEN=$HUD_API_TOKEN <image_name>

And modifications to default arguments can also be made as follows:

docker run -v <output_path>:/app/data/ --env HUD_API_TOKEN=$HUD_API_TOKEN <image_name> -C min_year={min_year} max_year={max_year}

Building the image

To create your own docker image

docker build -t <image_name> .

For a multiarch built do

docker buildx build --platform linux/amd64,linux/arm64 -t nsaph/zip2county_master_xwalk:latest . --push

Public data

Output crosswalks for default parameters and several different xwalk_method parameters can be found on the Harvard Dataverse https://doi.org/10.7910/DVN/0U2TCB. To cite with Bibtex use:

@data{DVN/0U2TCB_2024,
author = {Kitch, James},
publisher = {Harvard Dataverse},
title = {{ZIP Code to County Crosswalk}},
year = {2024},
version = {DRAFT VERSION},
doi = {10.7910/DVN/0U2TCB},
url = {https://doi.org/10.7910/DVN/0U2TCB}
}

Data information

Crosswalks assist researchers in translating data between different geographic units. For instance, a researcher might have hospitalization data at the ZIP-code level but other relevant variables at the U.S. county level. If the analysis is to be conducted at the county level, it's crucial in most traditional study designs to convert ZIP-level hospitalizations to county-level hospitalizations.

Both ZIP code and county boundaries, like many government-established geographic structures, are dynamic and change over time. While counties generally remain consistent, they can occasionally be subject to changes such as boundary adjustments, renaming, or splitting into new counties, usually due to administrative decisions or legislative actions. ZIP codes, managed by the U.S. Postal Service, are also subject to change.

This pipeline uses crosswalks from the U.S. Department of Housing and Urban Development (HUD), maintained on a quarterly basis, to facilitate these translations. Only Q4 crosswalks are used to construct the master crosswalk in this pipeline, though intermediate quarterly crosswalks are also downloaded. Differences between quarterly crosswalks within a year (e.g., Q3 2020 and Q4 2020) are typically minor. A brief overview of these differences is provided in notes/notes.Rmd.

Parameter adjustment

In its default form, which we call xwalk_criteria = "one2one", this crosswalk pipeline outputs the one "best" matching county for every ZIP code for each year from 2010 to 2023.

zip	county	year	tot_ratio
84712	49017	2016	1.0000000
84712	49031	2017	0.6666667
84712	49031	2018	0.6666667
84712	49031	2019	0.6666667
84712	49031	2020	0.6666667
84712	49017	2021	0.8928571

Matches were determined by finding the county code that contained the highest number of addresses from a given ZIP code, tot_ratio. It is also possible to set the configuration parameter criteria to bus_ratio, res_ratio, or oth_ratio which represent business addresses, residential addresses, and other addresses. These other criteria may provide lower numbers of zip matches depending on the area and years considered.

While the majority of ZIP-codes fall neatly into a single county, a significant fraction--roughly 15%--have at least 10% of addresses in a second, non-primary county. While certain types of analyses, especially those dealing with count data, may ignore this nuance, it may be important to keep in mind for other kinds of research. This pipeline can also return a more detailed breakdown of all the counties that have at least some shared addresses with a given FIPS code. We call this xwalk_criteria = "one2few".

zip	county	year	tot_ratio	top_match
84712	49017	2016	1.0000000	True
84712	49031	2017	0.6666667	True
84712	49017	2017	0.3333333	False
84712	49031	2018	0.6666667	True
84712	49017	2018	0.3333333	False
84712	49031	2019	0.6666667	True

In this case, the column top_match indicates if the county in that row is the highest-ranking match for the given zip in that specific year. Other options for crosswalk output are one2one_summy and one2few_summy, which simplify the data frame output through summarizing it across years. The following is example output from the pipeline with xwalk_method=one2one_summy:

zip	county	min_year	max_year	tot_ratio_avg	tot_ratio_min	tot_ratio_max
84712	49017	2010	2016	1.0000000	1.0000000	1.0000000
84712	49031	2017	2020	0.6666667	0.6666667	0.6666667
84712	49017	2021	2023	0.9036797	0.8928571	0.9090909

The min_year and max_year, and parameters control the minimum year for crosswalk analysis (data not available before 2010), maximum year for crosswalk analysis (maximum is 2023 at time of writing).

NSAPH-Data-Processing / zip2county_master_xwalk

readme

Zip code to county master crosswalk

Dockerized Pipeline

Public data

Data information

Parameter adjustment