[Experiment] Data validation with Soda

Related to #769.

TLDR: Easy-to-use framework. May not work with geospatial data checks.

Note, this PR has 2 commits: 1st commit is the one implementing Soda, and the write-up below focuses on the 1st commit; 2nd commit shows a new Soda feature using data contracts instead.

What

Implement data quality checks for bpl_libraries recipe in template-db dataset with Soda. Checks:

title column: can't have duplicate and null values
region column: values should be within a defined set of values
wkb_geometry column: can't have null values and should have state projection (custom data check).
- geospatial check didn't work

Where

Checks are performed in our Postgres db. Among many integrations with various sources, Soda can also be used on local files via dask or spark dataframe.

How

This section describes implementation from first commit.

Soda is very lightweight: it requires only 2 files for deploying data checks:

soda_sip/configuration.yml - contains db conn credentials: I used env variables to define them.
soda_sip/bpl_libraries_checks.yml - the actual data checks, aka expectations for bpl_libraries recipe. I also added more schema check examples that I found useful.

Running checks from CLI:

soda scan --data-source template_db --configuration ./configuration.yml bpl_libraries_checks.yml

An optional argument in the CLI command is saving output in a json file which can be viewed at soda_sip/output.json.

Other Soda notes

I like how easy and quick is defining data checks and the ability of creating custom data checks with a free-form SQL statement
I said above that Soda may not work with geospatial checks... In theory, it should because you can write free from SQL queries that are run in your engine of choice. However, I was getting a weird error:

I connected with Soda in their slack channel and they said it should work. They said I may have this issue.
Implementing Soda with local files via required dask dataframe is a bit annoying because it needs to be done in Python and dask uses presto SQL dialect under the hood
Regarding the data contract feature and how it's different from the main implementation:
- allows to define a true data contract for a dataset with custom key-value pairs such as product owner, description, contact info, etc
- all columns must be defined with an optional key whether a column is required.
- as of now, must be invoked from Python

NYCPlanning / data-engineering