DBT ZTL: create flow chart of entities

NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team

23 stars 0 forks source link

Simple case - preprocessing

One simpler case will be what happens in preprocessing.sql. This in a sense is what staging dbt tables are for - taking our inputs and getting them "ready" for use in our pipeline. Take dtm for example. Right now, we

load dtm with recipe/load logic

in preprocessing, we

rename geom column
create dof_dtm_tmp, with some aggregation and deduplication
drop original dtm table and replace with our cleaned table

Giving us something like this

flowchart LR dof_dtm --1--> dof_dtm dof_dtm --2--> dof_dtm_tmp dof_dtm_tmp --3--> dof_dtm

To turn this into a dbt dag, we simply move forwards only, rather than update an existing table. dof_dtm must exist in _sources.yml, and then in models/staging/stg__dof_dtm we have a single query which has

cte to rename geom column

cte to aggregate, deduplicate, etc

final query to select needed columns

Giving us a much cleaner flow of Giving us something like this

flowchart LR dof_dtm --> cte1 cte1 --> cte2 cte2 --> stg__dof_dtm

In the current build of ztl, once preprocessing.sql is run, we lose our original tables. Now with dbt, we'll have the original preserved dof_dtm and a table/view stg__dof_dtm which can be used in further models/queries.

More complex - build tables

When we're building, we have slightly more complex logic due to all the update statements. This is going to take a bit more work. But essentially, each sql file has some sort of domain, and likely can be simplified to one or two models that end up being joined at the end to make our final table.

A while back, I made a flow chart to represent ztl. This will be useful (you'll need to make at least a free-tier account to view)

A new diagram should be made in draw.io that represents a more idealized dag with no updates. This will likely change as development goes on, but will be a good starting place and will ensure that we understand the transformations that are occuring before we actually start changing code

Thought a bit more about what this initial flow chart should look like to be useful for creating dbt models.

As you know, dbt relies on your models being a DAG, only creating tables with select statements and not modifying tables with delete statements, update statements, or DDL (alter table, rename column, add column, etc). ZTL currently uses a lot these. However, each of these can be replaced with a select statement. Instead of

UPDATE a SET column1 = "true"

We instead have a new model with the code

SELECT
    'true' as column1,
    column2,
    column3
FROM a

dbt docs I think are fairly clear as to why this is useful - order of operations in sql becomes clearer, it's easier to follow lineage of columns, and after the project is "built", all intermediate tables and views remain, making it easier to debug (as opposed to mutating data as we go, where intermediate state of the build is lost) when things go wrong.

Thinking from this perspective, what would be a good starting point is a flow chart where we model each operation like this as a new node. So to reuse diagrams from above, instead of

flowchart LR
a --> a

We would write

flowchart LR
a --> a'

This doesn't mean that we will blindly turn each of these nodes from this process into models - the next step will be "grouping" nodes. Say we have

query where tables x and y are joined to make a
4 update statements in a row on table a, none of them referencing other tables

table a is joined to table b to produce c We would make a diagram like this

flowchart LR
x --> a
y --> a
a --> a'
a' --> a''
a'' --> a'''
a''' --> a''''
a'''' --> c
b --> c

Looking at this, it seems fairly apparent that we could group these a nodes together, and perform each of these update statements together.

Similarly, if we have a bunch of update statements in a row using additional tables, there might be room for simplification. Say we update a three times in a row, where each update uses another table instead this time.

flowchart LR
x --> a
y --> a
a --> a'
d --> a'
a' --> a''
e --> a''
a'' --> a'''
f --> a'''

This maybe is ripe for grouping as well - this maybe could be done with a single query joining a, d, e, and f. But this query could also get long and complicated. This will take looking into the queries a bit - if any of the tables d, e, or f are related, maybe they could be joined earlier. But it really depends on the underlying logic, and will need to be addressed on a case-by-case basis

Anyways. The first step to me seems to be mapping out the current logic into dbt entities, as I've described here (prime notation is fine), and then it'll take some effort to figure out both using the graph structure and investigating the entities in each of the tables, intermediates, and queries to see where we can group nodes to make queries simpler when we start moving sql to a dbt model framework

flowchart dcp_commercialoverlay --preprocessing (rename cols)--> dcp_commercialoverlay' dcp_limitedheight --preprocessing (rename cols)--> dcp_limitedheight' dcp_specialpurposesubdistricts --preprocessing (rename cols)--> dcp_specialpurposesubdistricts' dcp_specialpurpose --preprocessing (rename cols)--> dcp_specialpurpose' dcp_zoningdistricts --preprocessing (rename cols)--> dcp_zoningdistricts' dcp_zoningmapamendments --preprocessing (rename cols)--> dcp_zoningmapamendments' dcp_zoningmapindex --preprocessing (rename cols)--> dcp_zoningmapindex' dof_dtm --preprocessing (rename cols)--> dof_dtm' dof_dtm' --CTE coalesced--> dof_dtm'' dof_dtm'' --CREATE dof_dtm_tmp --> dof_dtm''' dof_dtm''' --index--> dof_dtm'''' dof_dtm'''' --CREATE validdtm--> dof_dtm''''' dof_dtm''''' --index--> dof_dtm'''''' dcp_zoningdistricts' --CREATE validzones--> dcp_zoningdistricts'' dcp_zoningdistricts'' --index--> dcp_zoningdistricts''' dcp_zoningdistricts''' --> lotzoneper dof_dtm''''''--> lotzoneper lotzoneper --CREATE lotzoneper_grouped--> lotzoneper' lotzoneper' --CTE initial_rankings--> lotzoneper'' lotzoneper'' --CREATE lotzoneperorder_init--> lotzoneper''' lotzoneper''' --CTE group_column_added--> lotzoneper'''' lotzoneper'''' --CTE reorder_groups --> lotzoneper''''' lotzoneper'''''--CTE rows_to_reorder --> lotzoneper'''''' zonedist_priority--CREATE and INSERT values-->zonedist_priority' zonedist_priority'--CTE new_order--> lotzoneper''''''' lotzoneper''''''--CTE new_order --> lotzoneper''''''' lotzoneper'''--CREATE lotzoneperorder--> lotzoneper'''''''' lotzoneper'''''''--CREATE lotzoneperorder--> lotzoneper'''''''' dcp_zoning_taxlot--CREATE empty table-->dcp_zoning_taxlot' dof_dtm''''--INSERT bbl...-->dcp_zoning_taxlot'' dcp_zoning_taxlot'--INSERT bbl-->dcp_zoning_taxlot'' dcp_zoning_taxlot''--SET zoningdistrict (1-4)-->dcp_zoning_taxlot''' lotzoneper'''''''' --SET zoningdistrict (1-4)--> dcp_zoning_taxlot''' dcp_commercialoverlay'--CTE commoverlayper--> dcp_commercialoverlay'' dof_dtm''''''--CTE commoverlayper-->dcp_commercialoverlay'' dcp_commercialoverlay''--CREATE commoverlayperorder-->dcp_commercialoverlay''' dcp_commercialoverlay'''--SET commercialoverlay (1-2)-->dcp_zoning_taxlot'''' dcp_zoning_taxlot'''--SET commercialoverlay (1-2)-->dcp_zoning_taxlot'''' dof_dtm''''''--CREATE specialpurposeper-->dcp_specialpurpose'' dcp_specialpurpose'--CREATE specialpurposeper-->dcp_specialpurpose'' dcp_specialpurpose''--CTE specialpurposeperorder_init-->dcp_specialpurpose''' dcp_specialpurpose'''--CREATE specialpurposeperorder-->dcp_specialpurpose'''' dcp_specialpurpose''''--SET specialdistrict(1-3)--> dcp_zoning_taxlot''''' dcp_zoning_taxlot''''--SET specialdistrict(1-3)--> dcp_zoning_taxlot''''' dof_dtm''''''--CTE limitedheightper-->dcp_limitedheight'' dcp_limitedheight'--CTE limitedheightper-->dcp_limitedheight'' dcp_limitedheight''--CREATE limitedheightperorder-->dcp_limitedheight''' dcp_limitedheight'''--SET limitedheightdistrict--> dcp_zoning_taxlot'''''' dcp_zoning_taxlot'''''--SET limitedheightdistrict--> dcp_zoning_taxlot'''''' dof_dtm''''''--CREATE zoningmapper-->dcp_zoningmapindex''' dcp_zoningmapindex'--CTE validindex-->dcp_zoningmapindex'' dcp_zoningmapindex''--CREATE zoningmapper-->dcp_zoningmapindex''' dcp_zoningmapindex'''--CREATE zoningmapperorder-->dcp_zoningmapindex'''' dcp_zoningmapindex''''--SET zoningmapnumber zoningmapcode--> dcp_zoning_taxlot''''''' dcp_zoning_taxlot''''''--SET zoningmapnumber zoningmapcode--> dcp_zoning_taxlot''''''' dcp_zoningmapamendments'--FLAG rezone-->dcp_zoningmapamendments'' dof_dtm''''--FLAG rezone-->dcp_zoningmapamendments'' dcp_zoningmapamendments'--FLAG inwood rezone-->dcp_zoningmapamendments''' dof_dtm''''--FLAG inwood rezone-->dcp_zoningmapamendments''' dcp_zoningmapamendments'''--SET notes-->dcp_zoning_taxlot'''''''' dcp_zoning_taxlot'''''''--SET notes-->dcp_zoning_taxlot'''''''' dcp_zoning_taxlot''''''''--PROCESSING parks-->dcp_zoning_taxlot''''''''' dcp_zoningmapamendments''--SET inzonechange-->dcp_zoning_taxlot'''''''''' dcp_zoning_taxlot'''''''''--SET inzonechange-->dcp_zoning_taxlot'''''''''' dcp_zoning_taxlot''''''''''--DROP duplicate values-->dcp_zoning_taxlot''''''''''' dcp_zoning_taxlot'''''''''''--CORRECT zoninggaps-->dcp_zoning_taxlot'''''''''''' dcp_zoning_taxlot''''''''''''--DROP invalid records-->dcp_zoning_taxlot''''''''''''' dcp_zoning_taxlot'''''''''''''---->dcp_zoning_taxlot_export

NYCPlanning / data-engineering

DBT ZTL: create flow chart of entities #911

Simple case - preprocessing

More complex - build tables