codeforboston / police-data-trust

A national archive of police data collected by journalists, lawyers, and activists around the country.
https://www.nationalpolicedata.org
MIT License
46 stars 80 forks source link

[FEATURE] Improve 50-a Data Collection #388

Open DMalone87 opened 5 months ago

DMalone87 commented 5 months ago

Is your feature request related to a problem? Please describe. Currently, our 50-a Scraper does not properly capture officer data. First, officers are not being associated with their Units. We are collecting the unit names, but we aren't taking the step of connecting each officer to the unit(s) that they've worked for. Second, we aren't properly collecting the complaints associated with each officer. We are collecting the dispositions of the complaints, but we aren't associating complaint data with individual officers.

Describe the solution you'd like When scraping officer data from 50-a.org, make the following adjustments:

This means an entry in the JSON output might change from this:

{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/TK8M", "name": "Benjamin F. Colecchia", "badge": "Badge #3490", "race": "White", "gender": "Male", "complaints": [{"name": "complaints", "count": 1}, {"name": "allegations", "count": 1}, {"name": "substantiated", "count": 0}, {"name": "Exonerated", "count": 1}], "age": null}
{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/7G3P", "name": "Ernesto Nieves", "badge": "Badge #4684", "race": "Hispanic", "gender": "Male", "complaints": [{"name": "complaints", "count": 2}, {"name": "allegations", "count": 2}, {"name": "substantiated", "count": 0}, {"name": "Complaint Withdrawn", "count": 1}, {"name": "Exonerated", "count": 1}], "age": "23"}

To this:

{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/TK8M", "name": "Benjamin F. Colecchia", "badge": "Badge #3490", "race": "White", "gender": "Male", "complaints": [9800290], "age": null, "taxnum": "918638"}
{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/7G3P", "name": "Ernesto Nieves", "badge": "Badge #4684", "race": "Hispanic", "gender": "Male", "complaints": [200410455, 200207742], "age": "23", "taxnum": "922871"}

When scraping command data, make the following adjustments:

Therefore this:

{"scraped_at": "2024-05-15 14:17:28", "name": "24th Precinct", "url": "https://www.50-a.org/command/24pct"}

Will become this:

{"scraped_at": "2024-05-15 14:17:28", "name": "24th Precinct", "url": "https://www.50-a.org/command/24pct"}, "website_url": "https://www1.nyc.gov/site/nypd/bureaus/patrol/precincts/24th-precinct.page", "commanding_officer": "https://www.50-a.org/officer/KYGH", "address": "151 W 100th St, New York, NY 10025", "description": "The 24th Precinct is located on the Upper West Side of Manhattan and encompasses Manhattan Valley and a portion of Riverside Park. It is a residential and commercial community of multiple dwelling homes and one major housing development.", "officers": [{"url": "https://www.50-a.org/officer/WHJ5", "most_recent": 2024}, {"url": "https://www.50-a.org/officer/4JJ9", "most_recent": 2024}, {"url": "https://www.50-a.org/officer/J7Y3", "most_recent": 2023}]}

Additional context

aasnani commented 3 months ago

image

aasnani commented 3 months ago

So with the inclusion of the officer data on their page, it looks like the ingestion of officer data should be split into two parts, the first being downloading the CSV and storing that data, and the second being scraping the other data(that isn't in the CSV) like the list of complaint numbers, gender, age, url, etc. We could then enrich the previously stored data in the ingestion layer or I guess incorporate pandas in the scraper repo and do some data processing there to enrich the CSV and output a single JSONL file. Not sure what the better approach is.

Page with the officer CSV: https://www.50-a.org/about

@DMalone87 you mentioned the complaints page but I'm not sure how to find it, could you link it here?

DMalone87 commented 3 months ago

Sure thing! Recently Added Complaints Recently Updated Complaints

aasnani commented 3 months ago

PR Here: https://github.com/National-Police-Data-Coalition/police-data-trust-scrapers/pull/17

EDIT: Closed, need to add tests. Will create another PR.

aasnani commented 3 months ago

PR Here: PR for Issue #388 (Improve 50A Collection) on Main Repo #19