chicago-police-violence / data

Dataset about the personnel, use of force, and complaints in the Chicago Police Department
MIT License
7 stars 0 forks source link
chicago-police-department dataset reproducible-research

The CPD Data Set

This repository contains data related to the activities of ~35,000 police officers in the Chicago Police department (CPD), including ~11,000 tactical response reports from 2004-2016 and ~110,000 civilian and administrative complaints from 2000-2018. The data was obtained following a series of requests covered by the Freedom of Information Act (FOIA) and coordinated by the Invisible Institute.

Details about the FOIA requests and which information about the CPD they cover can be found in the file raw/datasets.csv. The original data which serves as a starting point for this repository was imported from the Invisible Institute's download page

Requirements

Code

This code requires Python>=3.8 and GNU Make 4.3 (it will not work on earlier versions). You will require xlrd and openpyxl to read .xls and .xlsx files, respectively. Optionally, if you are planning to contribute changes to the code in this repository, you will need the black package for code formatting.

All Python dependencies can be installed by running

pip install -r requirements.txt

in the repository root folder.

Documentation (optional)

We have included a .pdf of the documentation in the current release version. But if you want to compile the documentation yourself from the source file docs/main.tex, you can either compile it however you normally would with your favourite LaTeX compiler (e.g. with pdflatex and bibtex), or you can run

make

in the docs/ folder to compile it with latexrun.

Obtaining the data

In order to build the cleaned and linked data, run

make

in the repository root folder. This will result in the creation of a single cleaned and linked set of data in the final/ folder, where all records (officers, complaints, and tactical response reports) are associated with unique IDs that enable linkage among the records.

How the data are processed

See the documentation main.pdf for an in-depth discussion of the data cleaning and linking. In brief, the make command will result in two primary data processing steps. First, in the cleaning step, the raw Excel files are converted to .csv files and field names are uniformized across files. To perform just the cleaning step, run the following command in the repository root folder:

make prepare

This will create a tidy/ folder containing cleaned versions of the original raw data.

Second, in the linking step, records of officers appearing in the different data files are linked by cleaning and matching their attributes, removing erroneous entries, etc. The linking step produces the final clean data files listed above. To perform just the linking step (after you have already run the cleaning step), the following command in the repository root folder:

make finalize

This will create a final/ folder containing the final cleaned and linked version of the data.

Data description

Once you have completed the above build step, the repository will contain the cleaned and linked data. In particular, the following files will have been generated:

A detailed description of the fields present in all of these files may be found in description.md.

Examples

You will find Jupyter notebooks in the examples/ folder that reproduce the visualizations in the documentation. In Jupyter lab/notebook, run Kernel -> Restart & Run All to run the notebooks. Note: currently the notebooks are coded such that they must be run in linear, top-to-bottom order (hence Kernel -> Restart & Run All).

Citation

If you use this dataset in your own project, please cite our paper published in the NeurIPS 2021 Track on Datasets and Benchmarks:

Thibaut Horel, Lorenzo Masoero, Raj Agrawal, Daria Roithmayr, and Trevor Campbell. The CPD Data Set: Personnel, Use of Force, and Complaints in the Chicago Police Department. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.

@inproceedings{Horel_NeurIPS21,
 author = {Horel, Thibaut and Masoero, Lorenzo and Agrawal, Raj and Roithmayr, Daria and Campbell, Trevor},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 editor = {J. Vanschoren and S. Yeung},
 pages = {},
 publisher = {Curran},
 title = {The CPD Data Set: Personnel, Use of Force, and Complaints in the Chicago Police Department},
 url = {https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/7f6ffaa6bb0b408017b62254211691b5-Paper-round2.pdf},
 volume = {1},
 year = {2021}
}

License

Copyright 2021 Thibaut Horel, Trevor Campbell, Lorenzo Masoero, Raj Agrawal, Daria Roithmayr

The code that cleans and links the data, as well as the code that produces the documentation for this project, is licensed under the MIT License; see MIT-LICENSE.txt for the license text. The dataset that is produced by the code is licensed under the Creative Commons 4.0 Attribution NonCommercial ShareAlike License; see CC-BY-NC-SA-LICENSE.txt for the license text.

The header image in this README is by Bert Kaufmann via Wikimedia Commons (CC BY 2.0).