codeforamerica / etl-for-america

A convening place for ETL, data integration, and "plumbing the pipes" of government data
34 stars 2 forks source link

Walkthrough of a simple ETL job #1

Open daguar opened 10 years ago

daguar commented 10 years ago

Idea taken from @boblannon's post here: http://sunlightfoundation.com/blog/2014/03/21/data-plumbers/

Since Bob calls out the UnitedStates repos as representing some good practices (straightforward, easily-accessible code) I think a walkthrough of a specific (small) script would be invaluable.

The walkthrough could call out explicit code decisions (eg, "we use X to download, and dump to this folder because...") that would make concrete some of the higher-level values stated in the post.

cc @konklone @JoshData @migurski

konklone commented 10 years ago

The first project as part of that organization, and the one whose ETL I think @boblannon is referencing most directly, is unitedstates/congress, a set of scrapers for Congressional information.

The project came about because both GovTrack (@joshdata) and Sunlight (@konklone) wanted to share the same basic scraper for Congress. GovTrack had been doing this work alone for almost 10 years with a very hardy set of Perl scripts that wrote directly to XML files, which were then distributed as bulk data and loaded into GovTrack's database. The new scrapers are written in Python, and write directly to JSON files. Some scrapers also download non-JSON source data, but any generated metadata is JSON.

Running

The scrapers are designed to be operated via CLI, and for their output to be data files on disk, to ensure use by any technology stack.

A scraper is run like so:

./run bills --congress=113

run is a thin shim that parses command line flags into a Python options dict, executes tasks with those options, and emails uncaught exceptions to admins. The above command executes tasks/bills.py's run method with an options dict of {'congress': 113}.

This manages two directories on disk: a cache directory for downloaded HTML files (to avoid unnecessary re-downloading) and a data directory for output files. These directories are at the top-level of the project (or elsewhere, if configured). The "branches" of the tree are elements of a record's unique ID, and "leaves" are files like data.json, or text.pdf.

Downloading some recent bills produces a directory structure like:

data/
  113/
    hconres/
    hjres/
    hres/
    hr/
    sconres/
    sjres/
    sres/
    s/
      s1/
      s2/
      s3/
         data.json
      ...

Support stack

We're using scrapelib for downloading, which emerged from the Open States team's ETL work. It manages rate-limiting, obeying robots.txt (if you want it to), and retry attempts. I think if we were redoing it now, though, we might use the well-loved requests library.

The project also has a well-worn download method that provides a lot of help for managing caching, parsing, handling both text and binary files, Unicode, and graceful error handling. This method is something I originally wrote in Ruby for our Congress API, which I ported to Python and @JoshData greatly expanded. Then today, one of our collaborators, @wilson428, ported that method to Node and released a standalone downcache module.

There are other hidden costs to ETL work, like detecting if something is amiss. Some scrapers reports back, for each record scraped, whether it went ok or not, whether it was saved or not, and what the reason is. Any non-ok records trigger an email to an administrator. In addition, any uncaught exceptions trigger an email, which is handled by some rudimentary SMTP utility code.

The downside of using a "standalone" scraper like this, which isn't on its own integrated into any system, is that you need to manage the timing and piping of data from the scraper into your system. For Sunlight's API, that's just a && between the command to scrape and the command to load in our crontab, but that's a pretty crude approach. Recently, @dvogel submitted a patch that adds a monkey-patching hook after each record, for a more sophisticated queue-based connector he'd like to implement for OpenCongress.

Language agnostic

Making the system language agnostic was necessary for the two original integrators -- GovTrack is a Python app, and Sunlight's Congress API is a Ruby app. Both systems simply depend on the output data directory of our scraper to be predictably arranged. So even if scrapers are expected to be written in Python, the integration point is language-agnostic.

Another project that does this is unitedstates/citation, a legal citation extractor written in JavaScript/Node. Sunlight uses this to batch process legal documents from a Ruby script over an HTTP interface. The DC Code browser's 1st version used it from straight in-browser JS; the 2nd version does it via a Node batch processor. It can also be operated via CLI over STDOUT. Designing the project this way makes integration a much easier pitch to non-Node projects.

I think building modules this way encourages a more Unix-y way of making tools that each do one thing well, encourages collaboration between different product owners, and just as importantly supports polyglot teams and architectures. I've noticed a lot of developers (and managers!) resist this idea, under what I think is the mistaken impression that committing to one language keeps projects and teams simpler and more efficient.

Python, Ruby, and Node each have ecosystems that have optimized for different sectors in the great world of computation. We shouldn't limit ourselves or our teams to just rallying around whatever looks like the best Giant Hammer language of the time. More languages will continue to emerge to serve different communities, like Julia, Go, and perhaps Rust. ETL projects that wish to be used by as many people or organizations as possible should plan to be useful to any modern OSS technology.