NYCPlanning / ae-data-flow

Data pipelines to populate Application Engineering databases
1 stars 0 forks source link

validate source data in an API DB #6

Closed damonmcc closed 3 months ago

damonmcc commented 3 months ago

problem

After we load source data into a database, we want to test some expectations of it (column names, uniqueness, etc.).

thoughts on solutions

It's easier to validate tables in a database than it is to validate the contents of csv files. To validate files, the process is generally:

The Data Engineering team has started using dbt to validate data in databases (both source and transformed) and has found it to be the easiest, most scalable, and most maintainable approach.

resoruces

Here's the intro to dbt in their docs and here's a great article summarizing data testing with dbt

Here are two video introductions to dbt

damonmcc commented 3 months ago

hey @bmarchena, I started workin on this and linked to some dbt resources in the issue description above

down to talk through how this all works too but thought it'd be nice to see some background info/docs on it too