NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

[Experiment] Data validation with Soda #818

Open sf-dcp opened 2 weeks ago

sf-dcp commented 2 weeks ago

Related to #769.

TLDR: Easy-to-use framework. May not work with geospatial data checks.

Note, this PR has 2 commits: 1st commit is the one implementing Soda, and the write-up below focuses on the 1st commit; 2nd commit shows a new Soda feature using data contracts instead.

What

Implement data quality checks for bpl_libraries recipe in template-db dataset with Soda. Checks:

Where

Checks are performed in our Postgres db. Among many integrations with various sources, Soda can also be used on local files via dask or spark dataframe.

How

This section describes implementation from first commit.

Soda is very lightweight: it requires only 2 files for deploying data checks:

Running checks from CLI:

soda scan --data-source template_db --configuration ./configuration.yml bpl_libraries_checks.yml 

image

An optional argument in the CLI command is saving output in a json file which can be viewed at soda_sip/output.json.

Other Soda notes

sf-dcp commented 2 weeks ago

Update: geospatial checks work when you specify schema where geospatial functions live (typically in public):

image

All credit to @damonmcc 🎉