Equitable-Polling-Locations

The software component of this project is a tool that chooses an optimal set of polling locations from a set of potential locations. Optionally, it also gives a "best case scenario" by searching among the centroids of census block groups, which don't correspond to buildings or street corners, but are suggestive of what an ideal distribution might look like.

Unlike other optimization tools out there, which minimize the mean distance traveled or the maximal distance traveled, this tool (which minimizes the Kolm-Pollak, or KP, distance) does a bit of both. For a detailed description of the methods used here, see Horton et al.. For more on the Kolm-Pollak distance and why it is suitable for optimizing with equity in mind, see the following papers: Sherrif, Macguire; Logan et al.; Kolm, 1976a; Kolm, 1976b.

The result analysis folder is an illustrative example of the type of analysis that can be done with the data generated by this code. The analysis code is in R.

Example

In the following table, the first three rows have the same mean, while the last three rows have the same maximal distance traveled. The KP minimizing optimization allows the user to set an aversion to inequality (beta) parameter that defines a tradeoff between mean and standard deviation of the distances traveled. For a large enough beta, the optimization will choose the last distribution. For a smaller beta, it will choose the second row.

Distances traveled	Mean minimizing	Max minimizing	KP minimizing
.25, .25, .25, .25, 4	Yes
.5, .5, .5, .5, 3	Yes	Yes	Depending on beta
.25, .25, .5, 1, 3	Yes	Yes	Depending on beta
.5, .5, .5, .75, 3		Yes

How it works

Given a set of existing and candidate polling locations, output the most equitable (by Kolm-Pollak distance) set of polling locations. The outputs of this model can be used to measure inequity among different racial groups in terms of access to polls (measured solely in terms of distance) and investigate how changes in choices and number of polling locations would change these inequities.

The algorithm for this model is as follows:

Create a list of potential polling locations
1. Start with a list of historical polling locations
2. Add to this a list of buildings where one could feasibly site future polling locations
3. Combine this data with a list of "best case scenario" polling locations, modeled by census block group centroids
Compute the distance from the centroid of each census block (representing residences) to each potential polling location (building or best case scenario)
1. We average over census blocks rather than individual houses for computational feasibility
Compute the Kolm-Pollak weight from each block group to each polling location
1. KP_factor = e^(- beta alpha distance)
  1. beta is a user defined parameter
  2. alpha is a data derived normalization factor (alpha = (\sum (block population distance_to_closest_poll)) / (\sum (block_population distance_to_closest_poll^2))
2. The KP_factor plays the role of a weighted distance in a standard objective function.
  1. The exponential in the KP_factor penalizes inequality in distances traveled
  2. For instance a group of 5 people all having to travel 1 mile to a polling location would have a lower KP_factor than a situation where 4 people travel 1/2 a mile while the fifth travels 3, even though the average distance traveled in both cases is the same.
Choose whether to minimize the average distance or the inequity penalized score (y_EDE) in the model
1. Set beta = 0 for average distance
  1. In this case, minimize the average distance traveled
2. Set beta in [-2, 0) for the inequity penalized score (y_EDE). The lower the beta, the greater penalty to inequality
  1. In this case, minimize (\sum block population * KP_factor)/ county population
Minimize the above according to the following constraints:
1. Can only have a user specified number of polling locations open
2. A user defined bound on the number of new locations
  1. Some maximal percent allowed to be new
  2. Some minimal percent that must have been a polling location in the past
  3. This can be easily modified to accommodate other needs (for example, require existing locations to remain open)
3. Each census block can only be matched to one polling location
4. Each census block must be matched to a single open precinct
5. A user defined overcrowding constraint
The model returns a list of matchings between census blocks and polling locations, along with the distance between the two, and a demographic breakdown of the population.
The model then uses this matching and demographic data to compute a new data derived scaling factor (alpha), which it then uses to compute the inequity penalized score (y_EDE) for the matched system.

A FEW THINGS TO NOTE:

Currently, this model is run on census data, which counts voting age population. We make no assumptions about eligibility to vote, either in terms of citizenship, local disqualification laws or voter registration status.
When this model reports racial demographics, it uses Census categories for race and ethnicity. Namely, Ethnicity (Hispanic / Non-Hispanic) is orthogonal to race in the census data. Therefore, one may be Hispanic and Asian at the same time.

To install

Clone main branch of Equitable-Polling-Locations
1. This repo uses lfs. This can be downloaded from [https://git-lfs.com/].
  1. Download the appropriate version from this website and follow the instructions included there.
  2. If those instructions don't work, (as may be the case on Linux or MacOS), run sudo ./install.sh after downloading the file, then follow the instructions above. See here.
Install conda if you do not have it already
1. This program uses SCIP as an optimizer, which is easily installed using Conda, but not using pip. (SCIP installation will be completed below by installing 'environment.yml')
2. If you do not have conda installed already, use the relevant instructions here
Create and activate conda environment. (Note, on a Windows machine, this requires using Anaconda Prompt.)
1. $ conda env create -f environment.yml
2. $ conda activate equitable-polls

To run

From command line:

First activate the environment if not done so already:

conda activate equitable-polls

In the directory of the Equitable-Polling-Locations git repo:
- python ./model_run_cli.py -c NUM -l LOG_DIR ./path/to/config/file.yaml
  - NUM = number of cores to use for simultaneous runs (recommend <=4 for most laptops)
  - LOG_DIR = Where to put log files. The directory must exist, or will not run
  - path to config file accepts wild cards to set of sequential runs
  - For extra logging include the flag -vv
  - Examples
    - See python ./model_run_cli.py -h
    - To run all expanded configs, parallel processing 4 at a time, and write log files out to the logs directory: python ./model_run_cli.py -c4 -l logs ./Gwinnett_GA_configs/Gwinnett_config_expanded_*.yaml
    - To run all full configs run one at a time, extra logging printed to the console, and write log files out to the logs directory: python ./model_run_cli.py -vv -l logs ./Gwinnett_GA_configs/Gwinnett_config_full_*.yaml
    - To run only the full_11 and write log files out to the logs directory: python ./model_run_cli.py -l logs ./Gwinnett_GA_configs/Gwinnett_config_full_11.yaml
  - NOTE: BEWARE OF CAPITALIZATION. Both ./Gwinnett_GA_configs/Gwinnett and ./Gwinnett_Ga_configs/Gwinnett (note capitalization) will run on Windows. However, due to string replacement work in other parts of the programs, the former is preferred.

From Google Colab:

For example, follow the the instructions in this file (To be accessed in the directory of the Equitable-Polling-Locations git repo)

Input files

There are six files needed to run this program. The current Repo contains these files for Gwinnett County, GA.
There are 4 files from the census needed for each county, These are pulled from the census the first time a county is run:
1. block level P3 data for a county (racial breakdown of voting age population)
2. block level P4 data for a county (ethnicity breakdown of voting age population)
3. block shape files
4. block group shape files
There is one manually generated file for each county
- previous and potential polling locations for a country
There is one config file needed as an argument to run the program

Census Data (demographics and shapefiles):

The sofware requires a free census API key to run new counties. You can apply on the cenus site and be approved in seconds.

1. Create the directory authentication_documents/ 
2. Inside authentication_documents/ create a file called census_key.py
3. The file should have a single line reading: census_key = "YOUR_KEY_VALUE"

If you are only running counties already in the repo you can use the empty string for your key (census_key = "") but the censu_key.py file must still exist locally.

datasets/polling/County_ST/County_ST_locations_only.csv:

This is a manually constructed .csv file that contains data for existing and potential polling locations to be optimized against Example file name: datasets/polling/Gwinnett_GA/Gwinnett_GA_locations_only.csv The columns of this data set should be named and formatted as	Column Name	Definition
Location	Name of the actual or potential polling location	'Bethesda Senior Center'
Address	Street Address of the actual or potential polling location	(format flexible) '788 Hillcrest Rd NW, Lilburn, GA 20047'
Location Type	If polling location, must have a year when it was used	'EV_2022_2020' or 'General_2020' or 'Primary_2022_2020_2018' or 'DropBox_2022'
	If potential location, has a 'location type' category and the word 'Potential' (case sensitive)	'Community Center - Potential'
Lat, Long	Comma separated concatenation of latitude and longitude (can be read off of google maps by right clicking on the location marker for the address.)	'33.964717796407434, -83.85827288222517'

datasets/driving/County_ST/County_ST_driving_distances.csv:

OPTIONAL file for using driving distances (that have been calculated externally) in the optimization. This file will only be accessed if the optional parameter 'driving' is set to True. Example file name: datasets/driving/Gwinnett_GA/Gwinnett_GA_driving_distances.csv The columns are as follows:	Column Name	Definition
id_orig	Census block id that matches the 'FIPSCODEBLOCKNUM' portion of the GEOID column from the file datasets/census/tiger/County_ST/tl_YYYY_FIPS_tabblockYY.shp file	131510703153004
id_dest	Name of potential polling location, as in the Location column of the file datasets/polling/County_ST/County_ST_locations_only.csv.	'EV_2022_2020' or 'General_2020' or 'Primary_2022_2020_2018' or 'DropBox_2022'
distance_m	Driving distance from id_orig to id_dest in meters	10040.72

CONFIG_FOLDER/County_config_DESCRIPTOR.yaml

These are the config files for the various runs.

Example path: Gwinnett_GA_configs/Gwinnett_config_full_11.yaml

Recommended convention: Each config folder should only have one parameter changing. For example, DeKalb_GA_no_bg_school_config should contain only (and all) runs with block groups and schools in the bad list, changing only the number of desired polling locations

Mandatory arguments
- location: County_ST. This variable is used throughout to name files
- year: List of years one wants to consider actual polling locations for. E.g. ['2022', '2020']
- bad_types: List of location types not to be considered in this model.
  - E.g. ['Election Day Loc - Potential', 'bg_centroid' ]
  - Must be labels already existing in the data
- beta: In [-2, 0]. Aversion to inequality. If 0, this computes the mean distance. The further away from 0, the greater the aversion to inequality.
- time_limit: maximal number of minutes that the optimizer will run
- capacity: >= 1. A multiplicative factor that indicates how much more than population/precincts_open a precinct is allowed to be allotted
Optional arguments
- precincts_open: number of precincts to be assigned. Default: number of existing polling locations
- max_min_mult: >= 1. A scalar to limit the search radius to match polling locations. If this is too small, the optimizer may not find a solution. Default: 1
- maxpctnew = In [0,1]. The percent of new locations allowed to be matched. Default = 1
- minpctold = In [0,1]. The percent of existing locations allowed to be matched. Default = 0
- penalized_sites: List of potential polling locations (subset of those considered in run) that are less desireable. A site in this list should be selected only if it improves access by x meters, where x is calculated according to the problem data. (See https://doi.org/10.48550/arXiv.2401.15452 for more information.) This option generates three additional log files: two for additional calls to the optimization solver ("...model2.log", "...model3.log") third ("...penalty.log") providing statistics related to the penalty heuristic.
- driving = In [True,False]. If True, then driving distances (versus straight-line/Haversine distances) are used. This option requires driving distances in the datasets folder as described above. Default = False

Logging

Working with code run from the command line interface

Currently the logging system in this project is a bit overly simplistic - they are print statements that are only run if the boolean variable "log" passed around is set to True. The logging used in the project is intended to work from the command line as well as from instances of Jupyter notebooks. Processes may be run concurrently so simply writing to the screen or a single log file will not work since one process may print to the screen at the same time as another. As such, all screen prints are suppressed unless multiple concurrency is disabled AND verbose mode is specified (-c0 -v on the command line).

When running from the command line, model_run.py will be called from model_run_cli.py. model_run_cli.py will parse all the command line arguments and call the function run_on_config found in model_run.py using multiple concurrent processes as requested by the user based on the concurrency option -c. Each call concurrent to run_on_config will contain individual instances of PollingModelConfig from model_config.py which is a simple container class to pass all the configuration needed to run pyomo/SCIP.

When multiple concurrency is selected, as discussed further in the "To run" section of this document, logs will be written to the log directory specified by the user when run from the command line interface instead of the screen, typically the directory ./logs instead of the screen.

PollingModelConfig will be setup with all the information needed to run a model, including where to write logs to in the variable log_file_path, which is a string to the specific file that should be written (appended) to. The value of log_file_path from PollingModelConfig is what is passed to pyomo so that it will write its log output to the correct location. The individual log files will be named after the config file being run prefixed with a time stamp. e.g. ./logs/20231207151550_Gwinnett_config_original_2020.yaml.log.

Using logs to debug

Until the logging system is updated to something more robust, any additional logging needed should be done with print statements that respect the logging boolean variable for use when concurrency is set to single threaded . Alternatively the file path log_file_path specified in the PollingModelConfig instance can be appended to.

If output from these log statements are needed then it is suggested that the command line be run in single concurrency mode with verbosity set to maximum e.g.:

python ./model_run_cli.py -c1 -vv -l logs ./Gwinnett_GA_configs/Gwinnett_config_expanded_*.yaml

When running concurrently, logs can be followed from the log directory in realtime using something like the following in Linux/MacOS:

tail -f ./logs/20231207151550_Gwinnett_config_original_2020.yaml.log

Intermediate dataset

datasets/polling/County_ST/County_ST.csv:

This is the main data set that the optimizer uses. It includes polling locations from previous years, potential polling locations, and block group centroids, as well as distances from block centroids to the above. Example file name: datasets/polling/Gwinett_GA/Gwinnett_GA.csv

The columns of this data set are as follows:	Column Name	Definition	Derivation
id_orig	Census block code	GEOID20 from block shape file	131350501051000
id_dest	Name of the actual or potential polling location	'Location' from County_ST_location_only.csv	'Bethesda Senior Center'
	Census block group code	GEOID20 from block group shape file	131350501051
distance_m	distance in meters from the centroid of the block (id_orig) to id_dest	haversine distance from (orig_lat, orig_lon) to (dest_lat, dest_lon)	FLOAT
county	name of county and two letter state abbreviation	location from the config file	'Gwinnett_GA'
address	If a physical polling location, street address	'Address' from County_ST_location_only.csv	'788 Hillcrest Rd NW, Lilburn, GA 20047'
	If not a potential coordinate, name of the associated census block group		NA
dest_lat	latitude of the address or census block group centroid of the destination	google maps or INTPTLAT20 of id_dest from block group shape file	FLOAT
dest_lon	longitude of the address or census block group centroid of the destination	google maps or INTPTLON20 of id_dest from block group shape file	FLOAT
orig_lat	latitude of census block centroid of the origin	INTPTLAT20 of id_orig from block shape file	FLOAT
orig_lon	longitude of census block centroid of the origin	INTPTLON20 of id_orig from block shape file	FLOAT
location_type	A description of the id_dest location	'Location Type' from County_ST_location_only.csv or 'bg_centroid'	'EV_2022_2020' or 'Library - Potential' or 'bg_centroid'
dest_type	A coarser description of the id_dest that given in location type	Either 'polling' (if previous polling location), potential (if a building that is a potential polling location), 'bg_centroid' (if a census block centroid)
population	total population of census block	'P3_001N' of P3 data or 'P4_001N' of P4 data	INT
hispanic	total hispanic population of census block	'P4_002N' of P4 data	INT
non-hispanic	total non-hispanic population of census block	'P4_003N' of P4 data	INT
white	single race white population of census block	'P3_003N' of P3 data	INT
black	single race black population of census block	'P3_004N' of P3 data	INT
native	single race native population of census block	'P3_005N' of P3 data	INT
asian	single race asian population of census block	'P3_006N' of P3 data	INT
pacific_islander	single race pacific_islander population of census block	'P3_007N' of P3 data	INT
other	single race other population of census block	'P3_008N' of P3 data	INT
multiple_races	total multi-racial population of census block	'P3_009N' of P3 data	INT

Output datasets

For each set of parameters specified in a config file (CONFIG_FOLDER/County_config_DESCRIPTOR.yaml), the program produces 4 output files.

If the file was run via Google Colab, the outputs are written in the folder Colab_results/County_ST_DESCRIPTOR_result
- The output files have the names:
  - County_config_DESCRIPTOR_edes.csv
  - County_config_DESCRIPTOR_precinct_distances.csv
  - County_config_DESCRIPTOR_residence_distances.csv
  - County_config_DESCRIPTOR_result.csv
If the file was run via command line, the outputs are written in the folder Gwinnett_GA_results/
- The output files have the names:
  - CONFIG_FOLDER.County_config_DESCRIPTOR_edes.csv
  - CONFIG_FOLDER.County_config_DESCRIPTOR_precinct_distances.csv
  - CONFIG_FOLDER.County_config_DESCRIPTOR_residence_distances.csv
  - CONFIG_FOLDER.County_config_DESCRIPTOR_result.csv

The four files can be described as follow:

*_edes.csv (demographic level ede scores)
- For each demographic group (asian, black, hispanic, native, population, white), this table records the
  - demo_pop, the total population of that demographic in the county
  - average distance traveled by the members of that demographic: average_distance = weighted_distance / demo_pop
  - the y_EDE for the demographic: y_EDE = -1/(beta alpha)log(avg_KP_weight)
    - where avg_KP_weight= (\sum demo_res_obj_summand)/demo_pop
*_precinct_distances.csv (distances traveled to each precinct by demographic)
- For each demographic group (asian, black, hispanic, native, population, white), and identified polling location (id_dest), this table records the
  - demo_pop, the total population of that demographic matched to that location
  - average distance traveled by the members of that demographic: average_distance = weighted_distance / demo_pop
*_demographic_distances.csv (distances traveled by members of a census block to each polling location by demographic)
- This is an interim table needed to create the *_ede.csv table
- For each demographic group (asian, black, hispanic, native, population, white), and census block (id_orig), this table records the
  - demo_pop, the total population of that demographic matched to that location
  - average distance traveled by the members of that demographic: average_distance = weighted_distance / demo_pop
*_result.csv (a combined table of census block, matched polling location, distance, and demographic information)
- This is a source table for the above three
- For each census block (id_orig), this table records the
  - polling location (id_dest) to which the census block is matched
  - the distance to this polling location
  - The County_ST of the run
  - the address of the the polling location (if it exists)
  - the coordinates of the block centroid (orig_lat and orig_lon) and the coordinates of the destination (dest_lat and dest_lon)
  - population of each of the demographic groups per census block
  - It also reports weighted distance and KP factor, which are population level variables, but these columns are never used and should be removed in a future release.

Result Analysis

TBW

need to have block group level demographics for maps

Acknowledgements

Our tool uses the SCIP mixed-integer optimization solver:

SCIP Optimization Suite 8.0
Ksenia Bestuzheva, Mathieu Besançon, Wei-Kun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny, Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, Marc E. Pfetsch, Daniel Rehfeldt, Steffan Schlein, Franziska Schlösser, Felipe Serrano, Yuji Shinano, Boro Sofranac, Mark Turner, Stefan Vigerske, Fabian Wegscheider, Philipp Wellner, Dieter Weninger, Jakob Witzig
Available at Optimization Online and as ZIB-Report 21-41, December 2021

How to cite

If you use the Equitable-Polling-Locations code base in your work, please cite the following:

The github repository using the "Cite this repository" dropdown menu
Horton, D., Logan, T., Murrell, J., Speakman, E. & Skipper, D. (2024). A Scalable Approach to Equitable Facility Location. https://doi.org/10.48550/arXiv.2401.15452

Voting-Rights-Code / Equitable-Polling-Locations

readme