TLDR: Easy-to-use framework. May not work with geospatial data checks.
Note, this PR has 2 commits: 1st commit is the one implementing Soda, and the write-up below focuses on the 1st commit; 2nd commit shows a new Soda feature using data contracts instead.
What
Implement data quality checks for bpl_libraries recipe in template-db dataset with Soda.
Checks:
title column: can't have duplicate and null values
region column: values should be within a defined set of values
wkb_geometry column: can't have null values and should have state projection (custom data check).
geospatial check didn't work
Where
Checks are performed in our Postgres db. Among many integrations with various sources, Soda can also be used on local files via dask or spark dataframe.
How
This section describes implementation from first commit.
Soda is very lightweight: it requires only 2 files for deploying data checks:
soda_sip/configuration.yml - contains db conn credentials: I used env variables to define them.
soda_sip/bpl_libraries_checks.yml - the actual data checks, aka expectations for bpl_libraries recipe. I also added more schema check examples that I found useful.
Running checks from CLI:
soda scan --data-source template_db --configuration ./configuration.yml bpl_libraries_checks.yml
An optional argument in the CLI command is saving output in a json file which can be viewed at soda_sip/output.json.
Other Soda notes
I like how easy and quick is defining data checks and the ability of creating custom data checks with a free-form SQL statement
I said above that Soda may not work with geospatial checks... In theory, it should because you can write free from SQL queries that are run in your engine of choice. However, I was getting a weird error:
I connected with Soda in their slack channel and they said it should work. They said I may have this issue.
Implementing Soda with local files via required dask dataframe is a bit annoying because it needs to be done in Python and dask uses presto SQL dialect under the hood
Regarding the data contract feature and how it's different from the main implementation:
allows to define a true data contract for a dataset with custom key-value pairs such as product owner, description, contact info, etc
all columns must be defined with an optional key whether a column is required.
Related to #769.
TLDR: Easy-to-use framework. May not work with geospatial data checks.
Note, this PR has 2 commits: 1st commit is the one implementing Soda, and the write-up below focuses on the 1st commit; 2nd commit shows a new
Soda
feature using data contracts instead.What
Implement data quality checks for
bpl_libraries
recipe intemplate-db
dataset withSoda
. Checks:title
column: can't have duplicate and null valuesregion
column: values should be within a defined set of valueswkb_geometry
column: can't have null values and should have state projection (custom data check).Where
Checks are performed in our Postgres db. Among many integrations with various sources, Soda can also be used on local files via
dask
orspark
dataframe.How
This section describes implementation from first commit.
Soda is very lightweight: it requires only 2 files for deploying data checks:
soda_sip/configuration.yml
- contains db conn credentials: I used env variables to define them.soda_sip/bpl_libraries_checks.yml
- the actual data checks, aka expectations forbpl_libraries
recipe. I also added more schema check examples that I found useful.Running checks from CLI:
An optional argument in the CLI command is saving output in a json file which can be viewed at
soda_sip/output.json
.Other Soda notes
I said above that Soda may not work with geospatial checks... In theory, it should because you can write free from SQL queries that are run in your engine of choice. However, I was getting a weird error:
I connected with Soda in their slack channel and they said it should work. They said I may have this issue.
dask
dataframe is a bit annoying because it needs to be done in Python anddask
uses presto SQL dialect under the hood