Gut checks and breakpoints for ETL pipelines

hancush commented 4 years ago

Background

One of the primary tenets of our approach to ETL is that it should be deterministic – that is, it should always produce the same result. Yet we also rely on external data sources, such as APIs or even client inputs. Variable inputs presents a challenge to deterministically producing outputs and can lead to baffling and difficult to debug errors.

Proposal

I'd like to do some reading about how other folks have handled this and compile some best practices and examples of good places to introduce defensive programming, validation of expected values, and even break points for manual review. I'll add these to our etl/ directory.

Deliverables

See above.

Timeline

I expect this to take about a day of focused work.

Cc @fgregg – any resources or examples that come to mind that may be relevant here?

hancush commented 4 days ago

An ancient validation script as part of an old PDF data extraction effort: https://github.com/City-Bureau/get-the-lead-out/blob/master/scripts/validate.py

hancush commented 4 days ago

Also love this resource: https://github.com/Quartz/bad-data-guide

datamade / how-to