Open daguar opened 10 years ago
The first project as part of that organization, and the one whose ETL I think @boblannon is referencing most directly, is unitedstates/congress, a set of scrapers for Congressional information.
The project came about because both GovTrack (@joshdata) and Sunlight (@konklone) wanted to share the same basic scraper for Congress. GovTrack had been doing this work alone for almost 10 years with a very hardy set of Perl scripts that wrote directly to XML files, which were then distributed as bulk data and loaded into GovTrack's database. The new scrapers are written in Python, and write directly to JSON files. Some scrapers also download non-JSON source data, but any generated metadata is JSON.
The scrapers are designed to be operated via CLI, and for their output to be data files on disk, to ensure use by any technology stack.
A scraper is run like so:
./run bills --congress=113
run
is a thin shim that parses command line flags into a Python options dict, executes tasks with those options, and emails uncaught exceptions to admins. The above command executes tasks/bills.py
's run
method with an options dict of {'congress': 113}
.
This manages two directories on disk: a cache
directory for downloaded HTML files (to avoid unnecessary re-downloading) and a data
directory for output files. These directories are at the top-level of the project (or elsewhere, if configured). The "branches" of the tree are elements of a record's unique ID, and "leaves" are files like data.json
, or text.pdf
.
Downloading some recent bills produces a directory structure like:
data/
113/
hconres/
hjres/
hres/
hr/
sconres/
sjres/
sres/
s/
s1/
s2/
s3/
data.json
...
We're using scrapelib for downloading, which emerged from the Open States team's ETL work. It manages rate-limiting, obeying robots.txt (if you want it to), and retry attempts. I think if we were redoing it now, though, we might use the well-loved requests library.
The project also has a well-worn download
method that provides a lot of help for managing caching, parsing, handling both text and binary files, Unicode, and graceful error handling. This method is something I originally wrote in Ruby for our Congress API, which I ported to Python and @JoshData greatly expanded. Then today, one of our collaborators, @wilson428, ported that method to Node and released a standalone downcache
module.
There are other hidden costs to ETL work, like detecting if something is amiss. Some scrapers reports back, for each record scraped, whether it went ok
or not, whether it was saved
or not, and what the reason
is. Any non-ok
records trigger an email to an administrator. In addition, any uncaught exceptions trigger an email, which is handled by some rudimentary SMTP utility code.
The downside of using a "standalone" scraper like this, which isn't on its own integrated into any system, is that you need to manage the timing and piping of data from the scraper into your system. For Sunlight's API, that's just a &&
between the command to scrape and the command to load in our crontab, but that's a pretty crude approach. Recently, @dvogel submitted a patch that adds a monkey-patching hook after each record, for a more sophisticated queue-based connector he'd like to implement for OpenCongress.
Making the system language agnostic was necessary for the two original integrators -- GovTrack is a Python app, and Sunlight's Congress API is a Ruby app. Both systems simply depend on the output data directory of our scraper to be predictably arranged. So even if scrapers are expected to be written in Python, the integration point is language-agnostic.
Another project that does this is unitedstates/citation, a legal citation extractor written in JavaScript/Node. Sunlight uses this to batch process legal documents from a Ruby script over an HTTP interface. The DC Code browser's 1st version used it from straight in-browser JS; the 2nd version does it via a Node batch processor. It can also be operated via CLI over STDOUT. Designing the project this way makes integration a much easier pitch to non-Node projects.
I think building modules this way encourages a more Unix-y way of making tools that each do one thing well, encourages collaboration between different product owners, and just as importantly supports polyglot teams and architectures. I've noticed a lot of developers (and managers!) resist this idea, under what I think is the mistaken impression that committing to one language keeps projects and teams simpler and more efficient.
Python, Ruby, and Node each have ecosystems that have optimized for different sectors in the great world of computation. We shouldn't limit ourselves or our teams to just rallying around whatever looks like the best Giant Hammer language of the time. More languages will continue to emerge to serve different communities, like Julia, Go, and perhaps Rust. ETL projects that wish to be used by as many people or organizations as possible should plan to be useful to any modern OSS technology.
Idea taken from @boblannon's post here: http://sunlightfoundation.com/blog/2014/03/21/data-plumbers/
Since Bob calls out the UnitedStates repos as representing some good practices (straightforward, easily-accessible code) I think a walkthrough of a specific (small) script would be invaluable.
The walkthrough could call out explicit code decisions (eg, "we use X to download, and dump to this folder because...") that would make concrete some of the higher-level values stated in the post.
cc @konklone @JoshData @migurski