[Experiment] - Implement data quality checks with GX

Related to #769.

TLDR: Robust and flexible framework, but too complex for simple tasks. IMO, adds a lot of overhead.

What

Implement data quality checks for bpl_libraries recipe in template-db dataset with Great Expectations (aka GX). Checks:

title column: can't have duplicate and null values
region column: values should be within a defined set of values
wkb_geometry column: can't have null values and should have state projection (custom data check).

GX fancy vocab

Don't need to read, but providing just in case

Data Context
- Entry point to GX and it contains various metadata about GX project (aka stores).
- Can be 1 of 3 types: python object, yaml file, in cloud
Data Source
- Pretty much connection to data
- Data source can have 3 types of connectors: runtime, specific, and inferred.
  - Runtime - no tables are specified in advance.
  - Specific - must specify tables in advance.
  - Inferred - reads in all tables from provided schema
- The data GX connects to via GX is referred as “Assets”. Assets are a collection of records. For example, an asset can be a SQL table or it can be a folder with csv files.
Expectations
- Data validation checks
- Grouped into “Expectation suites”, I.e. collections of checks
- There are built-in expectations. Community-generated expectations can be added via plugins. Also can create your own custom expectations
Batch
- a chunk of data to read in from the source. For example, you can read in only a slice of dataset that will be validated. Or it can be a result from a query
Validator
- an object that contains one table or a result from query as well as its associated data checks.
Checkpoints
- Running expectations on the given data
- Returns validation results
- Can configure checkpoints to perform custom actions depending on validation results like sending a notification

Where

Checks are performed in our Postgres db to allow for geospatial checks. Among many integrations with various sources, GX can also be used on local files via pandas or spark dataframe.

How

A lot of files in the gx/ directory are auto-generated. For example, gx/data_docs/static/ directory contains boilerplate code used in html docs pages.

The actual files I created for deployment:

gx/great_expectations.yml - this is a config file similar to dbt_project.yml. It contains info where GX should look for various components such as docs, custom data checks, expectations (aka data quality checks) and so forth. It also contains db conn credentials: I used env variables to define them.
gx/plugins/custom_expectations/expect_column_values_to_be_of_state_projection.py - my definition of a custom data check to validate projection. The way you define custom data checks in GX is via extending existing classes. I don't fully understand how they work under the hood, but I can tell It's cumbersome.
- Here are docs walking you through defining your custom data checks.
gx/expectations/bpl_libraries_checks.json - the actual data checks, aka expectations for bpl_libraries recipe. Note, the custom data test defined above is referenced here.
gx/checkpoints/my_checkpoint.yml - this is config file describing how GX should run expectations. For example, you can specify to run data checks on multiple datasets at once; or, additional actions you want to run in the case of success or failure like sending a slack message or saving output in a storage of your choice. You can have multiple checkpoint files depending on your case.

Running checks from CLI:

great_expectations checkpoint run my_checkpoint

Other GX notes

I really like GX docs for data checks, aka expectations. It gives examples and a list of all fn args.
Data checks are not as readable as Soda
Can create data source, expectations and checkpoint configs (generated yaml files) interactively via pre-configured by GX Jupyter Notebooks from local host that guide you through the process of generating them. It seems like people develop and revise data quality checks primarily in Jupyter Notebooks; though you can technically revise the json data check files directly.
Useful feature, data profiling, that automatically generates data checks boilerplate that you can revise.
Some geospatial checks are available through contributors, but they only integrate with geopandas

NYCPlanning / data-engineering