Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.

https://pdap.io

GNU General Public License v3.0

157 stars 33 forks source link

Open Data Network data source scraper #204

Closed josh-chamberlain closed 10 months ago

josh-chamberlain commented 1 year ago

The task

This is a list of potential data sources. (here it is in our data sources db)

Write a scraper which can collect information about these Data Sources and put them in a CSV, ready for upload to our Data Sources database.

We'll need a unique ID of some kind to check for duplicates when we run this again; maybe source_url?

Resources

Use the Data Sources data dictionary to see which properties we might like to know about each of these.
- most important are submitted_name, record_type, agency_described, source_url
- there may be others which are easy to grab and super helpful, like data_portal_type and readme_url
Use the Data Sources database for examples
This doesn't need to be automated; we can run it every once in a while.
This doesn't need to write to our database; uploading a CSV is pretty easy

EvilDrPurple commented 10 months ago

@josh-chamberlain So I've been experimenting with this, I got the number of relevant data sets down to 1157, down from 7000 since a lot of the data in the Public Safety category is irrelevant. I've run into a few issues, though:

Some submitted names are extremely unhelpful, such as: me, joy, and my favorite Test for API Download
The agency name is also all over the place, about ~250 of the results have no agency filled out at all, some are good like County of San Mateo Sheriff's Office, others are abbreviated like FLPD, and some are broad like City of Cincinnati
There's nothing to indicate record_type without manually looking through each one
It would likely be best to wait for our own API to be finished so that the program can compare the ODN link with what's in our database to eliminate duplicates before the csv is generated

With this in mind, do you have any ideas for how to proceed? Should I just go ahead despite these limitations? Is agency_described a required field and would that mean dropping about 1/5th of the available records?

josh-chamberlain commented 10 months ago

@EvilDrPurple 1157

I don't see a way to check this programmatically yet, so I'm fine to take some time to review these manually.
I think generating the CSV with this blank is OK, and it's easy to ditch them later. Our database has an approved checkbox, meaning we can create a modest little pile of data sources which need further classification. It's messed up that they don't have this standardized. They could probably use our Agencies db!
Machine learning magic? We have a decent dataset for identifying record type. We could collect as many of the other properties as is reasonably easy, and go back for those. Let's discuss this separately.
I'm down to do it "manually", deduping by URL is not too hard in csv editors or google sheets.