This repository contains code to construct 26 metrics for 25 predictors of mobility across 5 pillars that broadly measure mobility from poverty. The data are available for 3,143 counties (example: Los Angeles County) and 486 selected cities (example: Philadelphia).
To learn more about the upward mobility framework, please read:
To learn more about the data, please read:
The objective of this repository is to make all results reproducible, to document processes and assumptions, and to make it easier for analysts to produce metrics in future years. A little extra effort today can make a big difference in the future. For more motivation, please read the motivation for a style guide by Michael Stepner. If that isn't enough, read the section on technical debt.
This guide is a work-in-progress. If there are any ambiguities or unresolved questions, please contact Aaron R. Williams.
Note: The code is organized by nine domains for legacy reasons even though the updated framework is organized into five pillars. Below is a table showing each predictor by pillar, and the domain it was previously assigned to.
Legacy Domain | Pillar | Predictors |
---|---|---|
01_financial-well-being | Rewarding Work | Opportunities for income Financial security |
02_housing | Opportunity-Rich & Inclusive Neighborhoods | Wealth-building opportunities Housing affordability Housing stability |
03_health | Healthy Environment & Access to Good Health Care | Access to health services Neonatal health Safety from trauma |
05_local-governments | Responsive & Just Governance | Political participation Descriptive representation |
06_neighborhoods | Opportunity-Rich & Inclusive Neighborhoods | Economic inclusion Racial diversity Transportation access Environmental quality Social capital |
07_safety | Responsive & Just Governance | Safety from crime Just policing |
08_education | High-Quality Education | Access to preschool Effective public education School economic diversity Preparation for college Digital access |
09_employment | Rewarding Work | Employment opportunities Access to jobs paying a living wage |
The all metrics combined datasets for this project are read out into several file formats which are described below. The main difference is the geographic level of the data (city vs county), the number of years included and whether subgroups (i.e. race/ethnicity) are included. The all metrics combined files are in the "long" format as opposed to a "wide" format, meaning that in the files covering multiple years or subgroup each unique geography will account for more than one row. The data are hosted publicly on the Urban Institute data catalog.
The recent county file has exactly one row per county and contains the most recent year for each of the mobility metrics. This file should have exactly 3,143 observations and contain missing values where metrics were unavailable, suppressed, or not computed.
state | county | state_name | county_name | Var1... |
---|---|---|---|---|
01 | 001 | "Alabama" | "Autauga County" | |
01 | 003 | "Alabama" | "Baldwin County" | |
01 | 005 | "Alabama" | "Barbour County" |
The recent city file has one row per census place and contains the most recent year for each of the mobility metrics. This file should have exactly 486 observations and contain missing values where metrics were unavailable, suppressed, or not computed. Cities are defined as census places that have a population of 75,000 or greater.
The multi-year county file contains one observation per county per year. It contains missing values where metrics are unavailable, suppressed, or have not been computed. Prior to 2020 this file has 3,142 observations per year and 3,143 for the years 2020 to the most recent.
year | state | county | state_name | county_name | Var1... |
---|---|---|---|---|---|
2014 | 01 | 001 | "Alabama" | "Autauga County" | |
2014 | 01 | 003 | "Alabama" | "Baldwin County" | |
2014 | 01 | 005 | "Alabama" | "Barbour County" |
The multi-year city file contains one observation per large city per year. It contains missing values where metrics are unavailable, suppressed, or have not been computed. This file has 486 observations per year.
year | state | county | state_name | county_name | subgroup_type | subgroup |
---|---|---|---|---|---|---|
2014 | 01 | 001 | "Alabama" | "Autauga County" | "all" | "All" |
2014 | 01 | 001 | "Alabama" | "Autauga County" | "race-ethnicity" | "Black, Non-Hispanic" |
2014 | 01 | 001 | "Alabama" | "Autauga County" | "race-ethnicity" | "Hispanic" |
2014 | 01 | 001 | "Alabama" | "Autauga County" | "race-ethnicity" | "Other Races and Ethnicities" |
2014 | 01 | 001 | "Alabama" | "Autauga County" | "race-ethnicity" | "White, Non-Hispanic" |
.Rproj
. If using Stata, use projects. Otherwise, set the working directory. This ensures that the code is portable.data/
folder for intermediate data files. The data/
folder should be added to the .gitignore
. The final metric data should be added to GitHub.main
branch. This project uses a staging branch called version2024
that all updates should work through as if it were the main branch. All updates should be pushed to this branch.data/
folder for intermediate data files. The data/
folder should be added to the .gitignore
. The final metric files should be added to GitHub.version2024
branch to keep your local and remote branches up-to-date. Most merges will automatically resolve. Here are tips for resolving other merge conflicts.An Urban Institute-focused introduction to GitHub including installation instructions is available here.
After installing Git and setting up a GitHub account, follow these steps to get started on Windows:
git clone https://github.com/UI-Research/mobility-from-poverty.git
. You will need to enter your user email and password. Everything will then copy to your computer.gates-mobility-metrics
folder, right click, and select "Git Bash Here".git checkout -b version2024
to get to the staging branch. git checkout -b <"issue name">
but replace "issue name"
with the issue you are working on. git checkout <"issue name">
After this, you should be able to edit files and then add them to Git with the process outlined in the guide above.
GitHub will be used as the primary form of communication for programs and data. The workflow will rely on GitHub Issues that will be linked to metrics work goals. These issues will be organized and tracked using GitHub projects which can be viewed on the GitHub repository.
Note: The GitHub repository is public and all files that are not included in the gitignore will be publicly available when pushed to the repository.
version2024
branch - mobility-from-poverty
- and ensure it is up to date with GitHub:
git checkout version2024
o git pull origin version2024
git checkout -b <"issue name">
Add your changes to the code.
The command git status
shows which files have changed.
git diff <"filename">
will highlight which lines have been modified.
Use the arrow keys to scroll, and press q
if you need to exit the prompt.
git add [filename]
will stage files to commit (git add -u
will add all modified files).
git commit -m <"your message here">
will commit changes to version control.
Commit messages should be clear and meaningful.
git push origin <"issue branch name">
will push committed changes up to the GitHub for review.
main
branch, put in a Pull Request. Tag your assigned reviewer (@reviewer). Briefly describe what the PR does.version2024
branch. Reviewers may ask you to make changes. For Urban employees only, please reach out to the "umf-mobilitymetrics3" slack channel if you have questions.Issues exist for each metric update that needs to be completed. Metric leads will be assigned to their issues using GitHub (assignments will be linked to GitHub accounts).
Note that issues will include notes on what needs to be completed outside of updating to the latest data. There are two types of recurring notes that appear on most issues:
This section will walk through the standards around data starting with the raw data used to create the metric, joining variables that need to be included in every file, data naming and sorting conventions, data quality standards and standard errors, subgroup files and file naming/final metric file standards.
The first three variables in every file should be year
, state
, and county
/place
. year
should be a four digit numeric variable.
state
should be a two characters FIPS code.
county
should be a three character FIPS code.
place
should the 5-digit census place FIPS.
Intermediate files at the tract-level should include tract
as the fourth variable.
tract
should be a six character FIPS code.
All geography variables should have leading zeros for ids beginning in zeros.
The all metrics combined subgroup datasets will contain a subset of metrics from the original/years dataset because not all metrics will be extended for subgroup analysis.
The only variables in the subgroup datasets that will not be in the aggregate datasets will be subgroup_type
and subgroup
.
subgroup_type
will be the broader category that the descriptive variable the data is being broken out by falls into, for example race-ethnicity
subgroup
will be the name of the specific subgroup. These may differ some across metrics so we will need to converge on the appropriate names. The table below shows the current list of subgroup types and subgroup values, if your metric has subgroup data the values should match the names in the table below.
subgroup category | subgroup_type (variable name) | subgroup |
---|---|---|
Race and ethnicity | race-ethnicity | All Black, Non-Hispanic Hispanic Other Races and Ethnicities White, Non-Hispanic |
Race | race-ethnicity | All Black Hispanic Other Races and Ethnicities White |
Race share | race-share | All Majority Non-White Majority White, Non-Hispanic Mixed Race and Ethnicity |
Income | income | All Low Income Not Low-Income |
If you are an Urban employee and believe that the values of the subgroup do not align with the table above please reach out to the umf-mobilitymetrics3 slack channel for guidance.
In addition to the prescribed variable names (year, state, county, place, subgroup_type, and subgroup) each dataset will also have (a) variable(s) specific to the metric.
In previous rounds of this work, we renamed variables for metrics when building the database so the names are consistent and descriptive. All names start with the following:
share_
: For example, the variable showing the share with debt in collections is titled share_debt_col
pctl_
: For example, the variable showing the 20th percentile of income is titled pctl_income_20
rate_
: For example, the variable showing the reported violent crimes per 100,000 people is titled rate_violent_crime
count_
: For example, the variable showing the number of public-school children who are ever homeless during the school year is titled count_homeless
index_
: For example, the variable showing the air quality index is titled index_air_quality
Moving forward, please use these standardized variable names in the program for each of your assigned metrics. Variable names should only include lower case letters, numbers, and underscores (lower camel case, i.e. camel_case).
Values for subgroups will depend on data availability and prioritization.
For race, the objective is to pull "Black, Non-Hispanic", "Hispanic", "Other Races and Ethnicities", and "White, Non-Hispanic." If a subgroup lacks the precision to be responsibly reported, then report an NA
and set the data quality to NA
.
Try to not combine groups such as "Other Races and Ethnicities" with "White, Non-Hispanic".
year
, state
, and county
/place
, the first three variables in every file. Files at different geographic levels should be sorted by year
and then in order by largest geographic level (i.e. state) to smallest geographic level (i.e. Census block).year
, state
, county
/place
, subgroup_type
, and subgroup
. All sorting should be alphanumeric. Importantly, the race/ethnicity groups should be sorted alphabetically so that "Black, Non-Hispanic" appears first and "White, Non-Hispanic" appears last.leftjoin
in R you should have the crosswalk be the X variable). This ensures that the geographies included in the data are consistent across metrics (it is okay if your metric data is missing certain geographies). A new database with one observation per subgroup per county per year, so that metric values for subgroups are rows. This database will be in a long format and contain the "all" group. For example, if there are four subgroups then there should be 3,143x4 + 3,143x1 = 15,715 observations per year. This may seem foreign to some Stata and SAS programmers but it has several advantages.
_quality
. For example, the variable showing the air quality index is titled index_air_quality
.Score | Description |
---|---|
1 | The calculated metric for the observation is high-quality and there are no substantial concerns with measurement error, missingness, sample size, or precision. |
2 | There are issues with the calculated metric for the observation but the issues are limited. It is OK for a community partner to look at the metric. |
3 | There are serious issues with the calculated metric for the observation. It is possible to calculate the metric but there are critical issues with measurement error, missingness, sample size, and/or precision. A community should not act on this information. |
It was not possible to calculate a metric for the county or city. |
3
..csv
files. The variables should have the suffixes _lb
for lower bound and _ub
for upper bound._lb
and _ub
if a 95 percent confidence interval calculation isn't possible.Final metric files should have descriptive names related to the metric and must only include lower case letters, numbers, and underscores (lower camel case, i.e. camel_case). Do not use spaces.It is up to you how to name the files for your metric but the file names need to be consistent (meaning you should refer to the metric in the file name the same way every time) and should be concise.
Save data in a folder titled "final" to keep the repository organized. When saving files, include the year, geography (county or place), and subgroup information in the file name unless the file is combined (e.g. the file contains multiple years).
.csv
format. The files should be delimited with a comma..csv
files in Microsoft Excel. Excel defaults lead to analytic errors.The tidyverse style guide was written for R but contains lots of good language-agnostic suggestions for programming.
The top of each script should clearly label the purpose of the script. Here is an example Stata header:
/*************************/
Programmer: [your name]
Date created: [date]
Date of last revision: [date]
Ancestor Program: [Path to the program including the name of the program]
original data: [Path of where the data live]
Description: [Overall description]
(1)[insert task description here, and then copy & paste this to indicate where that task is later in your program]
(2)
(3) [etc...]
*/
/*************************/
Scripts should be clearly organized so others can follow them.
Include comments throughout your scripts so others can follow your work and decisions. Include comments that state "why", not "what". Include comments for all assumptions.
Use descriptive names for all variables, datasets, functions, and macros. Avoid abbreviations. Use ISO 8601 dates (YYYY-MM-DD).
Write assertions and in-line tests.
Assertions, things expected to always be true about the code, should be tested in-line.
healthinequality-code offers some good background.
assert
is useful in Stata and stopifnot()
is useful in R.
Write tests for final files. For example, write a test if all numbers should be non-negative or if values should not exceed \$3,000.
Check that the value ranges make sense. Spot-check your outliers to confirm that those values are not an error. In some cases, you might need to dig on the internet to see if say a community has the worst rates of air quality that there's some verification of that. Or if homelessness among students spiked that there's some context that could explain that. If not, then check your code to make sure it is doing what you think it is doing.
Also, check the data quality flag. Look at the distribution of assigned quality.
Write tests for macros and functions to ensure appropriate behavior.
Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead. --- Martin Fowler
Check final calculations against state and/or national numbers if available.
For questions about code, please contact upwardmobility\@urban.org and include "Code in GitHub" in the subject line.
Metric leads will need to decide whether to create new scripts/programs for extending the database (additional years or subgroup analysis) or to extend existing scripts. The optimal approach may differ based on the situation. For example, some metric leads will need to change datasets entirely (e.g. 1-year vs. 5-year ACS data) and new scripts may be most efficient and clean, while other metric leads may need to make minimal changes to an existing script.
All code and documentation will go through a review process. Code reviews will be handled through GitHub.
It is possible that changes will be requested before the completion of a code review. For example, a reviewer may send the code back to the analyst if the code isn't reproducible (i.e. doesn't run) or if the documentation is insufficient for th reviewer to follow the logic.
The scope of the review will involve the following three levels:
All scripts should run all the way through without errors. This should be the case regardless of the user/computer.
Our code review process will be handled through GitHub, which has powerful tools for code review. This page outlines the functionality.
In our workflow, every analyst will push their code to the repository on its own branch named after the issue created for that task.
The process of reconciling these different branches into one branch called version2024
is handled through pull requests.
For example, I will put in a pull request from "issexample"
to version2024
.
At this point, a reviewer will be requested in the pull request.
Aaron and Claudia will flag the reviewers.
The code will not be merged to version2024
until the reviewer(s) approve the pull request.
GitHub will generate a line-by-line comparison of every line that is added or removed from "issexample"
to version2024
.
Reviewers can add line-specific comments in GitHub.
Reviewers can also add overall comments before approving or requesting changes for the pull request. If additional changes are added, GitHub will highlight the specific lines that changed in response to the review--this will save the reviewer time on second or third reviews of the same code.
Once the code is approved, the branch can be merged into the main
branch where it can referenced and used for subsequent analyses.
Line-by-line edits and feedback should be handled by reviewers through the point-and-click interface on GitHub. Running code from a pull request will require branching.
Suppose you are reviewing code from branch "issexample2"
.
You need to "fetch" the "issexample2"
branch on to your local computer to run and review the code.
Steps:
mobility-from-poverty
directory and and selecting Git Bash Here (on Windows).version2024
branch git checkout version2024
git status
and ensure that you don't have any tracked changes that have not been committed.git branch
to see your current branch and other available branches. You should at least see version2024
and main
.git fetch
to get remote branches.git checkout --track origin/issexample2
to switch to the issexample2
branch. Submit git branch
to confirm the change.At this point, you should be able to run and review the code. Back on GitHub, you should be able to add line-by-line comments to the Pull Request if you click "Files changed" and then click the blue plus sign that appears next to the number by the line of code.
When your review is complete, click the green "Review changes" button on GitHub.
You should be able to add overall comments, approve the Pull Request, or Request changes to the Pull Request.
If you request changes, you will need to git pull issexample2
after the analyst pushes the updated code to GitHub.
When you are done, you can switch back to your branch with git checkout branch-name
where branch-name
is the name of the branch you wish to switch to.
If you have un-committed changes, you will need to get rid of them with git stash
.
You shouldn't make substantive changes on some else's branch.
After all metrics have updated on version2024
, reviewed and approved will the changes be merged with the main
repository.
The code to create the final collective files that combine all metrics is in 10_construct-database/
There will be two final files. The first file with be a year-county file with one row per county per year. The second file will be county-level file with only the most recent year of data for each variable. Both files will be tidy data with each variable in its own column, each observation in its own row, and each value in its own cell.
The data dictionary is a website created with Quarto and hosted on GitHub pages.
The Quarto documents are stored in mobility-from-poverty-documentation/
.
The folder contains its own .Rproj
for Quarto reasons.
The website is contained in docs/
.
Use the following steps to update the website.
mobility-from-poverty-documentation/mobility-from-poverty-documentation.Rproj
quarto render
at the command line.For users outside of the Urban Institute that would like to utilize this repository this section offers guidance and tips.
General repository structure
The folders in this repository are broken into three main sections: Domains, Data and Documentation/Auxiliary.
Domains
Data
Documentation/Auxiliary
For more specific use cases see below.
One benefit of hosting this work on a public repository is that external users can view and download the code used to create these indicators. If you would like to download the code used to create a certain UMF data point you can find it in one of the folder held in this repository. To track down the right folder, first utilize the table under repository contents to match your predictor of interest (right-hand column) with that predictor's domain. The folder containing the code for that predictor will have a title similar to the domain. Enter that folder on GitHub and locate the program file with the predictor in the title.
Example: Environmental quality The environmental quality predictor falls under the Opportunity-Rich & Inclusive Neighborhoods pillar. In the GitHub repository there is a folder titled 06_neighborhoods - this is the corresponding folder. Inside that folder is another folder titled environment. It is in that folder you will find the program that creates the predictor. GitHub will give you the option to download the raw version of the file.
The origins of the raw data used to create each of these indicators should be readily available in the code associated with each predictor. There are likely two primary ways you can trace the data:
1) For predictors where the raw data is not available for direct download through an API, the source of the raw data will be noted in text at the top of the code. Follow the link or description to download this data.
2) For predictors where the data is available through an API, the code will utilize that API to pull the data directly. You can read the section of the code that interacts with the API to see what variables and specifications are used.
For Urban employees only, please reach out to the "umf-mobilitymetrics3" slack channel if you have questions. For external users please contact Aaron R. Williams with questions.