Voting-Rights-Code / Equitable-Polling-Locations

Optimization tool for selecting the most equiatable set of polling locations (by Kolm-Pollack distance)
GNU General Public License v3.0
9 stars 3 forks source link

Equitable-Polling-Locations

The software component of this project is a tool that chooses an optimal set of polling locations from a set of potential locations. Optionally, it also gives a "best case scenario" by searching among the centroids of census block groups, which don't correspond to buildings or street corners, but are suggestive of what an ideal distribution might look like.

Unlike other optimization tools out there, which minimize the mean distance traveled or the maximal distance traveled, this tool (which minimizes the Kolm-Pollak, or KP, distance) does a bit of both. For a detailed description of the methods used here, see Horton et al.. For more on the Kolm-Pollak distance and why it is suitable for optimizing with equity in mind, see the following papers: Sherrif, Macguire; Logan et al.; Kolm, 1976a; Kolm, 1976b.

The result analysis folder is an illustrative example of the type of analysis that can be done with the data generated by this code. The analysis code is in R.

Example

In the following table, the first three rows have the same mean, while the last three rows have the same maximal distance traveled. The KP minimizing optimization allows the user to set an aversion to inequality (beta) parameter that defines a tradeoff between mean and standard deviation of the distances traveled. For a large enough beta, the optimization will choose the last distribution. For a smaller beta, it will choose the second row.

Distances traveled Mean minimizing Max minimizing KP minimizing
.25, .25, .25, .25, 4 Yes
.5, .5, .5, .5, 3 Yes Yes Depending on beta
.25, .25, .5, 1, 3 Yes Yes Depending on beta
.5, .5, .5, .75, 3 Yes

How it works

Given a set of existing and candidate polling locations, output the most equitable (by Kolm-Pollak distance) set of polling locations. The outputs of this model can be used to measure inequity among different racial groups in terms of access to polls (measured solely in terms of distance) and investigate how changes in choices and number of polling locations would change these inequities.

The algorithm for this model is as follows:

  1. Create a list of potential polling locations
    1. Start with a list of historical polling locations
    2. Add to this a list of buildings where one could feasibly site future polling locations
    3. Combine this data with a list of "best case scenario" polling locations, modeled by census block group centroids
  2. Compute the distance from the centroid of each census block (representing residences) to each potential polling location (building or best case scenario)
    1. We average over census blocks rather than individual houses for computational feasibility
  3. Compute the Kolm-Pollak weight from each block group to each polling location
    1. KP_factor = e^(- beta alpha distance)
      1. beta is a user defined parameter
      2. alpha is a data derived normalization factor (alpha = (\sum (block population distance_to_closest_poll)) / (\sum (block_population distance_to_closest_poll^2))
    2. The KP_factor plays the role of a weighted distance in a standard objective function.
      1. The exponential in the KP_factor penalizes inequality in distances traveled
      2. For instance a group of 5 people all having to travel 1 mile to a polling location would have a lower KP_factor than a situation where 4 people travel 1/2 a mile while the fifth travels 3, even though the average distance traveled in both cases is the same.
  4. Choose whether to minimize the average distance or the inequity penalized score (y_EDE) in the model
    1. Set beta = 0 for average distance
      1. In this case, minimize the average distance traveled
    2. Set beta in [-2, 0) for the inequity penalized score (y_EDE). The lower the beta, the greater penalty to inequality
      1. In this case, minimize (\sum block population * KP_factor)/ county population
  5. Minimize the above according to the following constraints:
    1. Can only have a user specified number of polling locations open
    2. A user defined bound on the number of new locations
      1. Some maximal percent allowed to be new
      2. Some minimal percent that must have been a polling location in the past
      3. This can be easily modified to accommodate other needs (for example, require existing locations to remain open)
    3. Each census block can only be matched to one polling location
    4. Each census block must be matched to a single open precinct
    5. A user defined overcrowding constraint
  6. The model returns a list of matchings between census blocks and polling locations, along with the distance between the two, and a demographic breakdown of the population.
  7. The model then uses this matching and demographic data to compute a new data derived scaling factor (alpha), which it then uses to compute the inequity penalized score (y_EDE) for the matched system.

A FEW THINGS TO NOTE:

  1. Currently, this model is run on census data, which counts voting age population. We make no assumptions about eligibility to vote, either in terms of citizenship, local disqualification laws or voter registration status.
  2. When this model reports racial demographics, it uses Census categories for race and ethnicity. Namely, Ethnicity (Hispanic / Non-Hispanic) is orthogonal to race in the census data. Therefore, one may be Hispanic and Asian at the same time.

To install

  1. Clone main branch of Equitable-Polling-Locations
    1. This repo uses lfs. This can be downloaded from [https://git-lfs.com/].
      1. Download the appropriate version from this website and follow the instructions included there.
      2. If those instructions don't work, (as may be the case on Linux or MacOS), run sudo ./install.sh after downloading the file, then follow the instructions above. See here.
  2. Install conda if you do not have it already
    1. This program uses SCIP as an optimizer, which is easily installed using Conda, but not using pip. (SCIP installation will be completed below by installing 'environment.yml')
    2. If you do not have conda installed already, use the relevant instructions here
  3. Create and activate conda environment. (Note, on a Windows machine, this requires using Anaconda Prompt.)
    1. $ conda env create -f environment.yml
    2. $ conda activate equitable-polls

To run

From command line:

First activate the environment if not done so already:

conda activate equitable-polls

From Google Colab:

Census Data (demographics and shapefiles):

The sofware requires a free census API key to run new counties. You can apply on the cenus site and be approved in seconds.

1. Create the directory authentication_documents/ 
2. Inside authentication_documents/ create a file called census_key.py
3. The file should have a single line reading: census_key = "YOUR_KEY_VALUE"

If you are only running counties already in the repo you can use the empty string for your key (census_key = "") but the censu_key.py file must still exist locally.

datasets/polling/County_ST/County_ST_locations_only.csv:

This is a manually constructed .csv file that contains data for existing and potential polling locations to be optimized against Example file name: datasets/polling/Gwinnett_GA/Gwinnett_GA_locations_only.csv The columns of this data set should be named and formatted as Column Name Definition Example
Location Name of the actual or potential polling location 'Bethesda Senior Center'
Address Street Address of the actual or potential polling location (format flexible) '788 Hillcrest Rd NW, Lilburn, GA 20047'
Location Type If polling location, must have a year when it was used 'EV_2022_2020' or 'General_2020' or 'Primary_2022_2020_2018' or 'DropBox_2022'
If potential location, has a 'location type' category and the word 'Potential' (case sensitive) 'Community Center - Potential'
Lat, Long Comma separated concatenation of latitude and longitude (can be read off of google maps by right clicking on the location marker for the address.) '33.964717796407434, -83.85827288222517'

datasets/driving/County_ST/County_ST_driving_distances.csv:

OPTIONAL file for using driving distances (that have been calculated externally) in the optimization. This file will only be accessed if the optional parameter 'driving' is set to True. Example file name: datasets/driving/Gwinnett_GA/Gwinnett_GA_driving_distances.csv The columns are as follows: Column Name Definition Example
id_orig Census block id that matches the 'FIPSCODEBLOCKNUM' portion of the GEOID column from the file datasets/census/tiger/County_ST/tl_YYYY_FIPS_tabblockYY.shp file 131510703153004
id_dest Name of potential polling location, as in the Location column of the file datasets/polling/County_ST/County_ST_locations_only.csv. 'EV_2022_2020' or 'General_2020' or 'Primary_2022_2020_2018' or 'DropBox_2022'
distance_m Driving distance from id_orig to id_dest in meters 10040.72

CONFIG_FOLDER/County_config_DESCRIPTOR.yaml

These are the config files for the various runs.

Example path: Gwinnett_GA_configs/Gwinnett_config_full_11.yaml

Recommended convention: Each config folder should only have one parameter changing. For example, DeKalb_GA_no_bg_school_config should contain only (and all) runs with block groups and schools in the bad list, changing only the number of desired polling locations

Logging

Working with code run from the command line interface

Currently the logging system in this project is a bit overly simplistic - they are print statements that are only run if the boolean variable "log" passed around is set to True. The logging used in the project is intended to work from the command line as well as from instances of Jupyter notebooks. Processes may be run concurrently so simply writing to the screen or a single log file will not work since one process may print to the screen at the same time as another. As such, all screen prints are suppressed unless multiple concurrency is disabled AND verbose mode is specified (-c0 -v on the command line).

When running from the command line, model_run.py will be called from model_run_cli.py. model_run_cli.py will parse all the command line arguments and call the function run_on_config found in model_run.py using multiple concurrent processes as requested by the user based on the concurrency option -c. Each call concurrent to run_on_config will contain individual instances of PollingModelConfig from model_config.py which is a simple container class to pass all the configuration needed to run pyomo/SCIP.

When multiple concurrency is selected, as discussed further in the "To run" section of this document, logs will be written to the log directory specified by the user when run from the command line interface instead of the screen, typically the directory ./logs instead of the screen.

PollingModelConfig will be setup with all the information needed to run a model, including where to write logs to in the variable log_file_path, which is a string to the specific file that should be written (appended) to. The value of log_file_path from PollingModelConfig is what is passed to pyomo so that it will write its log output to the correct location. The individual log files will be named after the config file being run prefixed with a time stamp. e.g. ./logs/20231207151550_Gwinnett_config_original_2020.yaml.log.

Using logs to debug

Until the logging system is updated to something more robust, any additional logging needed should be done with print statements that respect the logging boolean variable for use when concurrency is set to single threaded . Alternatively the file path log_file_path specified in the PollingModelConfig instance can be appended to.

If output from these log statements are needed then it is suggested that the command line be run in single concurrency mode with verbosity set to maximum e.g.:

python ./model_run_cli.py -c1 -vv -l logs ./Gwinnett_GA_configs/Gwinnett_config_expanded_*.yaml

When running concurrently, logs can be followed from the log directory in realtime using something like the following in Linux/MacOS:

tail -f ./logs/20231207151550_Gwinnett_config_original_2020.yaml.log

Intermediate dataset

datasets/polling/County_ST/County_ST.csv:

This is the main data set that the optimizer uses. It includes polling locations from previous years, potential polling locations, and block group centroids, as well as distances from block centroids to the above. Example file name: datasets/polling/Gwinett_GA/Gwinnett_GA.csv

The columns of this data set are as follows: Column Name Definition Derivation Example / Type
id_orig Census block code GEOID20 from block shape file 131350501051000
id_dest Name of the actual or potential polling location 'Location' from County_ST_location_only.csv 'Bethesda Senior Center'
Census block group code GEOID20 from block group shape file 131350501051
distance_m distance in meters from the centroid of the block (id_orig) to id_dest haversine distance from (orig_lat, orig_lon) to (dest_lat, dest_lon) FLOAT
county name of county and two letter state abbreviation location from the config file 'Gwinnett_GA'
address If a physical polling location, street address 'Address' from County_ST_location_only.csv '788 Hillcrest Rd NW, Lilburn, GA 20047'
If not a potential coordinate, name of the associated census block group NA
dest_lat latitude of the address or census block group centroid of the destination google maps or INTPTLAT20 of id_dest from block group shape file FLOAT
dest_lon longitude of the address or census block group centroid of the destination google maps or INTPTLON20 of id_dest from block group shape file FLOAT
orig_lat latitude of census block centroid of the origin INTPTLAT20 of id_orig from block shape file FLOAT
orig_lon longitude of census block centroid of the origin INTPTLON20 of id_orig from block shape file FLOAT
location_type A description of the id_dest location 'Location Type' from County_ST_location_only.csv or 'bg_centroid' 'EV_2022_2020' or 'Library - Potential' or 'bg_centroid'
dest_type A coarser description of the id_dest that given in location type Either 'polling' (if previous polling location), potential (if a building that is a potential polling location), 'bg_centroid' (if a census block centroid)
population total population of census block 'P3_001N' of P3 data or 'P4_001N' of P4 data INT
hispanic total hispanic population of census block 'P4_002N' of P4 data INT
non-hispanic total non-hispanic population of census block 'P4_003N' of P4 data INT
white single race white population of census block 'P3_003N' of P3 data INT
black single race black population of census block 'P3_004N' of P3 data INT
native single race native population of census block 'P3_005N' of P3 data INT
asian single race asian population of census block 'P3_006N' of P3 data INT
pacific_islander single race pacific_islander population of census block 'P3_007N' of P3 data INT
other single race other population of census block 'P3_008N' of P3 data INT
multiple_races total multi-racial population of census block 'P3_009N' of P3 data INT

Output datasets

For each set of parameters specified in a config file (CONFIG_FOLDER/County_config_DESCRIPTOR.yaml), the program produces 4 output files.

The four files can be described as follow:

Result Analysis

TBW

Acknowledgements

Our tool uses the SCIP mixed-integer optimization solver:

SCIP Optimization Suite 8.0
Ksenia Bestuzheva, Mathieu Besançon, Wei-Kun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny, Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, Marc E. Pfetsch, Daniel Rehfeldt, Steffan Schlein, Franziska Schlösser, Felipe Serrano, Yuji Shinano, Boro Sofranac, Mark Turner, Stefan Vigerske, Fabian Wegscheider, Philipp Wellner, Dieter Weninger, Jakob Witzig
Available at Optimization Online and as ZIB-Report 21-41, December 2021

How to cite

If you use the Equitable-Polling-Locations code base in your work, please cite the following:

  1. The github repository using the "Cite this repository" dropdown menu

  2. Horton, D., Logan, T., Murrell, J., Speakman, E. & Skipper, D. (2024). A Scalable Approach to Equitable Facility Location. https://doi.org/10.48550/arXiv.2401.15452