The software component of this project is a tool that chooses an optimal set of polling locations from a set of potential locations. Optionally, it also gives a "best case scenario" by searching among the centroids of census block groups, which don't correspond to buildings or street corners, but are suggestive of what an ideal distribution might look like.
Unlike other optimization tools out there, which minimize the mean distance traveled or the maximal distance traveled, this tool (which minimizes the Kolm-Pollak, or KP, distance) does a bit of both. For a detailed description of the methods used here, see Horton et al.. For more on the Kolm-Pollak distance and why it is suitable for optimizing with equity in mind, see the following papers: Sherrif, Macguire; Logan et al.; Kolm, 1976a; Kolm, 1976b.
The result analysis
folder is an illustrative example of the type of analysis that can be done with the data generated by this code. The analysis code is in R.
In the following table, the first three rows have the same mean, while the last three rows have the same maximal distance traveled. The KP minimizing optimization allows the user to set an aversion to inequality (beta) parameter that defines a tradeoff between mean and standard deviation of the distances traveled. For a large enough beta, the optimization will choose the last distribution. For a smaller beta, it will choose the second row.
Distances traveled | Mean minimizing | Max minimizing | KP minimizing |
---|---|---|---|
.25, .25, .25, .25, 4 | Yes | ||
.5, .5, .5, .5, 3 | Yes | Yes | Depending on beta |
.25, .25, .5, 1, 3 | Yes | Yes | Depending on beta |
.5, .5, .5, .75, 3 | Yes |
Given a set of existing and candidate polling locations, output the most equitable (by Kolm-Pollak distance) set of polling locations. The outputs of this model can be used to measure inequity among different racial groups in terms of access to polls (measured solely in terms of distance) and investigate how changes in choices and number of polling locations would change these inequities.
The algorithm for this model is as follows:
A FEW THINGS TO NOTE:
sudo ./install.sh
after downloading the file, then follow the instructions above. See here.$ conda env create -f environment.yml
$ conda activate equitable-polls
From command line:
First activate the environment if not done so already:
conda activate equitable-polls
python ./model_run_cli.py -h
python ./model_run_cli.py -c4 -l logs ./Gwinnett_GA_configs/Gwinnett_config_expanded_*.yaml
python ./model_run_cli.py -vv -l logs ./Gwinnett_GA_configs/Gwinnett_config_full_*.yaml
python ./model_run_cli.py -l logs ./Gwinnett_GA_configs/Gwinnett_config_full_11.yaml
From Google Colab:
For example, follow the the instructions in this file (To be accessed in the directory of the Equitable-Polling-Locations git repo)
There are six files needed to run this program. The current Repo contains these files for Gwinnett County, GA.
There are 4 files from the census needed for each county, These are pulled from the census the first time a county is run:
There is one manually generated file for each county
There is one config file needed as an argument to run the program
The sofware requires a free census API key to run new counties. You can apply on the cenus site and be approved in seconds.
1. Create the directory authentication_documents/
2. Inside authentication_documents/ create a file called census_key.py
3. The file should have a single line reading: census_key = "YOUR_KEY_VALUE"
If you are only running counties already in the repo you can use the empty string for your key (census_key = "") but the censu_key.py file must still exist locally.
This is a manually constructed .csv file that contains data for existing and potential polling locations to be optimized against Example file name: datasets/polling/Gwinnett_GA/Gwinnett_GA_locations_only.csv The columns of this data set should be named and formatted as | Column Name | Definition | Example |
---|---|---|---|
Location | Name of the actual or potential polling location | 'Bethesda Senior Center' | |
Address | Street Address of the actual or potential polling location | (format flexible) '788 Hillcrest Rd NW, Lilburn, GA 20047' | |
Location Type | If polling location, must have a year when it was used | 'EV_2022_2020' or 'General_2020' or 'Primary_2022_2020_2018' or 'DropBox_2022' | |
If potential location, has a 'location type' category and the word 'Potential' (case sensitive) | 'Community Center - Potential' | ||
Lat, Long | Comma separated concatenation of latitude and longitude (can be read off of google maps by right clicking on the location marker for the address.) | '33.964717796407434, -83.85827288222517' |
OPTIONAL file for using driving distances (that have been calculated externally) in the optimization. This file will only be accessed if the optional parameter 'driving' is set to True. Example file name: datasets/driving/Gwinnett_GA/Gwinnett_GA_driving_distances.csv The columns are as follows: | Column Name | Definition | Example |
---|---|---|---|
id_orig | Census block id that matches the 'FIPSCODEBLOCKNUM' portion of the GEOID column from the file datasets/census/tiger/County_ST/tl_YYYY_FIPS_tabblockYY.shp file | 131510703153004 | |
id_dest | Name of potential polling location, as in the Location column of the file datasets/polling/County_ST/County_ST_locations_only.csv. | 'EV_2022_2020' or 'General_2020' or 'Primary_2022_2020_2018' or 'DropBox_2022' | |
distance_m | Driving distance from id_orig to id_dest in meters | 10040.72 |
These are the config files for the various runs.
Example path: Gwinnett_GA_configs/Gwinnett_config_full_11.yaml
Recommended convention: Each config folder should only have one parameter changing. For example, DeKalb_GA_no_bg_school_config should contain only (and all) runs with block groups and schools in the bad list, changing only the number of desired polling locations
Mandatory arguments
Optional arguments
Currently the logging system in this project is a bit overly simplistic - they are print statements that are only run if the boolean variable "log" passed around is set to True
. The logging used in the project is intended to work from the command line as well as from instances of Jupyter notebooks. Processes may be run concurrently so simply writing to the screen or a single log file will not work since one process may print to the screen at the same time as another. As such, all screen prints are suppressed unless multiple concurrency is disabled AND verbose mode is specified (-c0 -v
on the command line).
When running from the command line, model_run.py will be called from model_run_cli.py. model_run_cli.py will parse all the command line arguments and call the function run_on_config
found in model_run.py using multiple concurrent processes as requested by the user based on the concurrency option -c
. Each call concurrent to run_on_config will contain individual instances of PollingModelConfig
from model_config.py which is a simple container class to pass all the configuration needed to run pyomo/SCIP.
When multiple concurrency is selected, as discussed further in the "To run" section of this document, logs will be written to the log directory specified by the user when run from the command line interface instead of the screen, typically the directory ./logs
instead of the screen.
PollingModelConfig will be setup with all the information needed to run a model, including where to write logs to in the variable log_file_path
, which is a string to the specific file that should be written (appended) to. The value of log_file_path
from PollingModelConfig is what is passed to pyomo so that it will write its log output to the correct location. The individual log files will be named after the config file being run prefixed with a time stamp. e.g.
./logs/20231207151550_Gwinnett_config_original_2020.yaml.log
.
Until the logging system is updated to something more robust, any additional logging needed should be done with print statements that respect the logging
boolean variable for use when concurrency is set to single threaded . Alternatively the file path log_file_path
specified in the PollingModelConfig instance can be appended to.
If output from these log statements are needed then it is suggested that the command line be run in single concurrency mode with verbosity set to maximum e.g.:
python ./model_run_cli.py -c1 -vv -l logs ./Gwinnett_GA_configs/Gwinnett_config_expanded_*.yaml
When running concurrently, logs can be followed from the log directory in realtime using something like the following in Linux/MacOS:
tail -f ./logs/20231207151550_Gwinnett_config_original_2020.yaml.log
This is the main data set that the optimizer uses. It includes polling locations from previous years, potential polling locations, and block group centroids, as well as distances from block centroids to the above. Example file name: datasets/polling/Gwinett_GA/Gwinnett_GA.csv
The columns of this data set are as follows: | Column Name | Definition | Derivation | Example / Type |
---|---|---|---|---|
id_orig | Census block code | GEOID20 from block shape file | 131350501051000 | |
id_dest | Name of the actual or potential polling location | 'Location' from County_ST_location_only.csv | 'Bethesda Senior Center' | |
Census block group code | GEOID20 from block group shape file | 131350501051 | ||
distance_m | distance in meters from the centroid of the block (id_orig) to id_dest | haversine distance from (orig_lat, orig_lon) to (dest_lat, dest_lon) | FLOAT | |
county | name of county and two letter state abbreviation | location from the config file | 'Gwinnett_GA' | |
address | If a physical polling location, street address | 'Address' from County_ST_location_only.csv | '788 Hillcrest Rd NW, Lilburn, GA 20047' | |
If not a potential coordinate, name of the associated census block group | NA | |||
dest_lat | latitude of the address or census block group centroid of the destination | google maps or INTPTLAT20 of id_dest from block group shape file | FLOAT | |
dest_lon | longitude of the address or census block group centroid of the destination | google maps or INTPTLON20 of id_dest from block group shape file | FLOAT | |
orig_lat | latitude of census block centroid of the origin | INTPTLAT20 of id_orig from block shape file | FLOAT | |
orig_lon | longitude of census block centroid of the origin | INTPTLON20 of id_orig from block shape file | FLOAT | |
location_type | A description of the id_dest location | 'Location Type' from County_ST_location_only.csv or 'bg_centroid' | 'EV_2022_2020' or 'Library - Potential' or 'bg_centroid' | |
dest_type | A coarser description of the id_dest that given in location type | Either 'polling' (if previous polling location), potential (if a building that is a potential polling location), 'bg_centroid' (if a census block centroid) | ||
population | total population of census block | 'P3_001N' of P3 data or 'P4_001N' of P4 data | INT | |
hispanic | total hispanic population of census block | 'P4_002N' of P4 data | INT | |
non-hispanic | total non-hispanic population of census block | 'P4_003N' of P4 data | INT | |
white | single race white population of census block | 'P3_003N' of P3 data | INT | |
black | single race black population of census block | 'P3_004N' of P3 data | INT | |
native | single race native population of census block | 'P3_005N' of P3 data | INT | |
asian | single race asian population of census block | 'P3_006N' of P3 data | INT | |
pacific_islander | single race pacific_islander population of census block | 'P3_007N' of P3 data | INT | |
other | single race other population of census block | 'P3_008N' of P3 data | INT | |
multiple_races | total multi-racial population of census block | 'P3_009N' of P3 data | INT |
For each set of parameters specified in a config file (CONFIG_FOLDER/County_config_DESCRIPTOR.yaml), the program produces 4 output files.
If the file was run via Google Colab, the outputs are written in the folder Colab_results/County_ST_DESCRIPTOR_result
If the file was run via command line, the outputs are written in the folder Gwinnett_GA_results/
The four files can be described as follow:
TBW
Our tool uses the SCIP mixed-integer optimization solver:
SCIP Optimization Suite 8.0
Ksenia Bestuzheva, Mathieu Besançon, Wei-Kun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny, Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, Marc E. Pfetsch, Daniel Rehfeldt, Steffan Schlein, Franziska Schlösser, Felipe Serrano, Yuji Shinano, Boro Sofranac, Mark Turner, Stefan Vigerske, Fabian Wegscheider, Philipp Wellner, Dieter Weninger, Jakob Witzig
Available at Optimization Online and as ZIB-Report 21-41, December 2021
If you use the Equitable-Polling-Locations code base in your work, please cite the following:
The github repository using the "Cite this repository" dropdown menu
Horton, D., Logan, T., Murrell, J., Speakman, E. & Skipper, D. (2024). A Scalable Approach to Equitable Facility Location. https://doi.org/10.48550/arXiv.2401.15452