Closed josh-chamberlain closed 10 months ago
@josh-chamberlain So I've been experimenting with this, I got the number of relevant data sets down to 1157, down from 7000 since a lot of the data in the Public Safety category is irrelevant. I've run into a few issues, though:
County of San Mateo Sheriff's Office
, others are abbreviated like FLPD
, and some are broad like City of Cincinnati
record_type
without manually looking through each oneWith this in mind, do you have any ideas for how to proceed? Should I just go ahead despite these limitations? Is agency_described
a required field and would that mean dropping about 1/5th of the available records?
@EvilDrPurple 1157
approved
checkbox, meaning we can create a modest little pile of data sources which need further classification. It's messed up that they don't have this standardized. They could probably use our Agencies db!
The task
This is a list of potential data sources. (here it is in our data sources db)
Write a scraper which can collect information about these Data Sources and put them in a CSV, ready for upload to our Data Sources database.
We'll need a unique ID of some kind to check for duplicates when we run this again; maybe
source_url
?Resources
submitted_name
,record_type
,agency_described
,source_url
data_portal_type
andreadme_url