NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

[Experiment] - Implement data quality checks with GX #809

Open sf-dcp opened 2 weeks ago

sf-dcp commented 2 weeks ago

Related to #769.

TLDR: Robust and flexible framework, but too complex for simple tasks. IMO, adds a lot of overhead.

What

Implement data quality checks for bpl_libraries recipe in template-db dataset with Great Expectations (aka GX). Checks:

GX fancy vocab

Don't need to read, but providing just in case

Where

Checks are performed in our Postgres db to allow for geospatial checks. Among many integrations with various sources, GX can also be used on local files via pandas or spark dataframe.

How

A lot of files in the gx/ directory are auto-generated. For example, gx/data_docs/static/ directory contains boilerplate code used in html docs pages.

The actual files I created for deployment:

Running checks from CLI:

great_expectations checkpoint run my_checkpoint

image

Other GX notes