TLDR: Robust and flexible framework, but too complex for simple tasks. IMO, adds a lot of overhead.
What
Implement data quality checks for bpl_libraries recipe in template-db dataset with Great Expectations (aka GX).
Checks:
title column: can't have duplicate and null values
region column: values should be within a defined set of values
wkb_geometry column: can't have null values and should have state projection (custom data check).
GX fancy vocab
Don't need to read, but providing just in case
Data Context
Entry point to GX and it contains various metadata about GX project (aka stores).
Can be 1 of 3 types: python object, yaml file, in cloud
Data Source
Pretty much connection to data
Data source can have 3 types of connectors: runtime, specific, and inferred.
Runtime - no tables are specified in advance.
Specific - must specify tables in advance.
Inferred - reads in all tables from provided schema
The data GX connects to via GX is referred as “Assets”. Assets are a collection of records. For example, an asset can be a SQL table or it can be a folder with csv files.
Expectations
Data validation checks
Grouped into “Expectation suites”, I.e. collections of checks
There are built-in expectations. Community-generated expectations can be added via plugins. Also can create your own custom expectations
Batch
a chunk of data to read in from the source. For example, you can read in only a slice of dataset that will be validated. Or it can be a result from a query
Validator
an object that contains one table or a result from query as well as its associated data checks.
Checkpoints
Running expectations on the given data
Returns validation results
Can configure checkpoints to perform custom actions depending on validation results like sending a notification
Where
Checks are performed in our Postgres db to allow for geospatial checks. Among many integrations with various sources, GX can also be used on local files via pandas or spark dataframe.
How
A lot of files in the gx/ directory are auto-generated. For example, gx/data_docs/static/ directory contains boilerplate code used in html docs pages.
The actual files I created for deployment:
gx/great_expectations.yml - this is a config file similar to dbt_project.yml. It contains info where GX should look for various components such as docs, custom data checks, expectations (aka data quality checks) and so forth. It also contains db conn credentials: I used env variables to define them.
gx/plugins/custom_expectations/expect_column_values_to_be_of_state_projection.py - my definition of a custom data check to validate projection. The way you define custom data checks in GX is via extending existing classes. I don't fully understand how they work under the hood, but I can tell It's cumbersome.
Here are docs walking you through defining your custom data checks.
gx/expectations/bpl_libraries_checks.json - the actual data checks, aka expectations for bpl_libraries recipe. Note, the custom data test defined above is referenced here.
gx/checkpoints/my_checkpoint.yml - this is config file describing how GX should run expectations. For example, you can specify to run data checks on multiple datasets at once; or, additional actions you want to run in the case of success or failure like sending a slack message or saving output in a storage of your choice. You can have multiple checkpoint files depending on your case.
Running checks from CLI:
great_expectations checkpoint run my_checkpoint
Other GX notes
I really like GX docs for data checks, aka expectations. It gives examples and a list of all fn args.
Data checks are not as readable as Soda
Can create data source, expectations and checkpoint configs (generated yaml files) interactively via pre-configured by GX Jupyter Notebooks from local host that guide you through the process of generating them. It seems like people develop and revise data quality checks primarily in Jupyter Notebooks; though you can technically revise the json data check files directly.
Useful feature, data profiling, that automatically generates data checks boilerplate that you can revise.
Some geospatial checks are available through contributors, but they only integrate with geopandas
Related to #769.
TLDR: Robust and flexible framework, but too complex for simple tasks. IMO, adds a lot of overhead.
What
Implement data quality checks for
bpl_libraries
recipe intemplate-db
dataset with Great Expectations (aka GX). Checks:title
column: can't have duplicate and null valuesregion
column: values should be within a defined set of valueswkb_geometry
column: can't have null values and should have state projection (custom data check).GX fancy vocab
Don't need to read, but providing just in case
Where
Checks are performed in our Postgres db to allow for geospatial checks. Among many integrations with various sources, GX can also be used on local files via
pandas
orspark
dataframe.How
A lot of files in the
gx/
directory are auto-generated. For example,gx/data_docs/static/
directory contains boilerplate code used in html docs pages.The actual files I created for deployment:
gx/great_expectations.yml
- this is a config file similar todbt_project.yml
. It contains info where GX should look for various components such as docs, custom data checks, expectations (aka data quality checks) and so forth. It also contains db conn credentials: I used env variables to define them.gx/plugins/custom_expectations/expect_column_values_to_be_of_state_projection.py
- my definition of a custom data check to validate projection. The way you define custom data checks in GX is via extending existing classes. I don't fully understand how they work under the hood, but I can tell It's cumbersome.gx/expectations/bpl_libraries_checks.json
- the actual data checks, aka expectations forbpl_libraries
recipe. Note, the custom data test defined above is referenced here.gx/checkpoints/my_checkpoint.yml
- this is config file describing how GX should run expectations. For example, you can specify to run data checks on multiple datasets at once; or, additional actions you want to run in the case of success or failure like sending a slack message or saving output in a storage of your choice. You can have multiple checkpoint files depending on your case.Running checks from CLI:
Other GX notes
Soda
json
data check files directly.geopandas