Open josh-chamberlain opened 1 week ago
Clarification needed:
There is the muckrock_tools
and then there is the muckrock_scraper.py
with templates
. Should only the muckrock_tools
be moved (since the other doesn't appear to deal with source collection)?
@eddie-m-m good catch, yes! The other one is for grabbing files from MuckRock, and it's in the right spot!
Context
https://github.com/Police-Data-Accessibility-Project/data-source-identification
The Scrapers repo is for collecting data from one or more source at a time for use/analysis.
However, we have some tools for scraping with the express goal of generating sources (lists of URLs) for submission to our database. We have tools in the data source ID repo which can parse those lists of URLs, either identifying agencies, sending the URLs to our annotation pipeline, etc.
Requirements
common_crawler
called something like "source_collectors"