Open hancush opened 4 years ago
An ancient validation script as part of an old PDF data extraction effort: https://github.com/City-Bureau/get-the-lead-out/blob/master/scripts/validate.py
Also love this resource: https://github.com/Quartz/bad-data-guide
Background
One of the primary tenets of our approach to ETL is that it should be deterministic – that is, it should always produce the same result. Yet we also rely on external data sources, such as APIs or even client inputs. Variable inputs presents a challenge to deterministically producing outputs and can lead to baffling and difficult to debug errors.
Proposal
I'd like to do some reading about how other folks have handled this and compile some best practices and examples of good places to introduce defensive programming, validation of expected values, and even break points for manual review. I'll add these to our
etl/
directory.Deliverables
See above.
Timeline
I expect this to take about a day of focused work.
Cc @fgregg – any resources or examples that come to mind that may be relevant here?