NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Improved Automated QA for Recipes and Products #116

Open alexrichey opened 11 months ago

alexrichey commented 11 months ago

In building FacDB, I ran into a few issues which were trivial to fix, but nonetheless took time to diagnose. They might provide a nice framework to discuss potential improvements.

Here's a sampling of issues I encountered Building FacDB

What I'm thinking for next steps: Perhaps model out the required/important fields in FacDB for 1) a subset of recipes 2) for the output to edm-publishing. The output of this exercise would be some declarative format about expectations (probably fields with types modeled in yml) which we'd then use to write automations to detect and potentially coerce out-of-spec data into a usable format. Modeling might be a nice way to indicate which columns actually matter at the periphery. E.g. at ingestion time, if we no longer have the boro field in dsny_electronicsdrop does that actually matter?

There are some neat libraries (e.g. Cerberus) that we might make use of, though I think it would be nice to get our feet wet before making a decision about them.

Thoughts @fvankrieken , @damonmcc . Would love to hear about your pain points as well. If it'd be easier, we could just huddle and jot down some notes.

damonmcc commented 11 months ago

random thoughts:

definitely like the focus on what columns a product needs aka a product's expectations

is a significant amount of this pain caused by the current data_library approaches? it doesn't save the raw data and we still expect it do some transformations often via it's python scripts?

I imagine this could go too far, but I think I like the general approach of something failing during a build rather than investing too much in pre-build checks. Definitely still love declaring expectations though!

on the topic of yml files for modeling, here's an example of declaring tests for certain columns in a dbt project. that one only uses a couple of the built-in tests but there's tons of others (dbt_expectations)