biglocalnews / warn-transformer

Consolidate, enrich and republish the data gathered by warn-scraper
https://warn-transformer.readthedocs.io
Apache License 2.0
4 stars 3 forks source link

Meet Great Expectations (WIP) #252

Open chriszs opened 3 months ago

chriszs commented 3 months ago

This PR is a work-in-progress draft of a potential command to validate raw data using Great Expectations. It creates an expectation suite that checks if each raw CSV has three or more rows and then opens an HTML report listing the results.

This is very much a first effort, and we would probably want to factor it a little differently if we decided to use it.

Usage

The following should validate CSVs in the default raw directory used by warn-transformer, verifying that each has three or more rows, creating a data quality report in a temporary directory and opening it in a browser (obviously we'd want to persist it somewhere and/or alert off of it in production):

pipenv install
python -m warn_transformer.cli validate -l DEBUG

Screenshots

In this example, I hand-edited ak.csv to fail the check:

Great Expectations validation results, the ak raw data source has failed while al has succeeded

Detail on the failure:

Detail on the ak failure, showing which check it failed

Related to #236

chriszs commented 3 months ago

Great Expectations is apparently only compatible with Python 3.8 and up, so I removed 3.7 from the CI matrix for demonstration purposes.

Also, believe updating Pipfile.lock when I added GE may have also upgraded some non-pinned deps. Flake8 is now at 7.0, which has at least one incompatibility with current version in pre-commit (so probably should upgrade the version in pre-commit or pin the Pipfile version).

There's a 1.0 version of GE now in pre-release, which seems like it will move stuff around (but isn't well-documented yet), so I locked it to the current point release.