Define common 'interface' for all scrapers

Police-Data-Accessibility-Project / meta

Planning our activities with issues that don't fit in a specific repository yet.

GNU General Public License v3.0

693 stars 58 forks source link

Define common 'interface' for all scrapers #92

Closed jameskranz closed 3 years ago

jameskranz commented 4 years ago

Possible things to define:

Where data should land when a scraper is run?
- Should all scrapers handle upload to another service? If so, let's decide where.
- Preferably, something like a $OUTPUT_DIR system variable should be respected by the scraper and data should be put there. Then some common sidecar process when the scraper is run in the cloud can handle watching that directory and moving the data towards its final destination.
Should there be a common way to specify where to start? It isn't clear to me whether this has to be done as whole-sale scraping A-Z, or if certain data sets can be targeted in batches. Somewhat similar to message queue offsets.

mcsaucy commented 4 years ago

$OUTPUT_DIR seems like a good approach since it's universally easy to implement and keeps development easy (no need to set up dev instance of whatever we're storing). There is ongoing work in #76 to dockerize the existing scrapers, so we could have that common sidecar process watch some directory which is bind mounted into that container.

Francoded commented 4 years ago

Hopped over from Issue #95. Do we have a design doc? I feel that may be a good start. I've drafted some of my initial thoughts onto a Google doc last night and I'd love to move that over to a central place if we have one already.

jameskranz commented 4 years ago

@Francoded I'm not aware of a central location for documents like that, but definitely think opening one for comment and collaboration is a great idea.

Francoded commented 4 years ago

Cool! I'll just drop the google doc I created last night here. I didn't spend too much time on it as I wasn't sure how much progress we've made in this regard so the doc is definitely in its early stages. @jameskranz, I can add you as an editor so you can help flesh out details if you'd like. I haven't said this yet but I do think an $OUTPUT_DIR environment variable is a good idea.

The document does target Python (as that is the language I'll probably using for this project) but the design should be easily applied to other languages. Since this is targeting Python, I'd love to get some input from @OscarVanL who has been working on the FL Bay County Python scraper.

skoold2003 commented 4 years ago

Eventually the data would be compiled in a DB but that has yet to be built. For now I've been using Contentful as a data source and api. I think it would be good to use that until a DB/API of our own has been built. I can add people to the space on there if we want to go that route.

jameskranz commented 4 years ago

Eventually the data would be compiled in a DB but that has yet to be built. For now I've been using Contentful as a data source and api. I think it would be good to use that until a DB/API of our own has been built. I can add people to the space on there if we want to go that route.

Yes, it will eventually be in a DB. However, I think it simplifies scrapers to not have them handle upload. This contract would allow for a single common sidecar application that is deployed into the same container/a sibling container to handle the upload side of things without the scraper having to detail either about how the upload occurs, or if the method of data upload changes. Imagine if we decided to change something and then all scrapers in all the different languages people wrote them in had to have the upload methods changed.