NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

DBT ZTL: create flow chart of entities #911

Open fvankrieken opened 1 week ago

fvankrieken commented 1 week ago

Once dbt is set up, as a first step of dbt-ifying existing sql in zoning tax lots database, we'll need to figure out what the structure/flow of dbt models will look like.

dbt does not support updates, or cyclic references (one table used to build another which is used to update the first table, stuff along those lines). So much of the code of this repo will need to be restructured.

Simple case - preprocessing

One simpler case will be what happens in preprocessing.sql. This in a sense is what staging dbt tables are for - taking our inputs and getting them "ready" for use in our pipeline. Take dtm for example. Right now, we

Giving us something like this

flowchart LR
  dof_dtm --1--> dof_dtm 
  dof_dtm --2--> dof_dtm_tmp 
  dof_dtm_tmp --3--> dof_dtm

To turn this into a dbt dag, we simply move forwards only, rather than update an existing table. dof_dtm must exist in _sources.yml, and then in models/staging/stg__dof_dtm we have a single query which has

Giving us a much cleaner flow of Giving us something like this

flowchart LR
  dof_dtm --> cte1
  cte1 --> cte2
  cte2 --> stg__dof_dtm

In the current build of ztl, once preprocessing.sql is run, we lose our original tables. Now with dbt, we'll have the original preserved dof_dtm and a table/view stg__dof_dtm which can be used in further models/queries.

More complex - build tables

When we're building, we have slightly more complex logic due to all the update statements. This is going to take a bit more work. But essentially, each sql file has some sort of domain, and likely can be simplified to one or two models that end up being joined at the end to make our final table.

A while back, I made a flow chart to represent ztl. This will be useful (you'll need to make at least a free-tier account to view)

https://lucid.app/lucidchart/5ddbf6ef-3185-4071-86c4-61277dd9510d/edit?page=0_0&invitationId=inv_6cedd686-76a7-491f-bc8c-cc3746ed38f2#

A new diagram should be made in draw.io that represents a more idealized dag with no updates. This will likely change as development goes on, but will be a good starting place and will ensure that we understand the transformations that are occuring before we actually start changing code

fvankrieken commented 1 week ago

Thought a bit more about what this initial flow chart should look like to be useful for creating dbt models.

As you know, dbt relies on your models being a DAG, only creating tables with select statements and not modifying tables with delete statements, update statements, or DDL (alter table, rename column, add column, etc). ZTL currently uses a lot these. However, each of these can be replaced with a select statement. Instead of

UPDATE a SET column1 = "true"

We instead have a new model with the code

SELECT
    'true' as column1,
    column2,
    column3
FROM a

dbt docs I think are fairly clear as to why this is useful - order of operations in sql becomes clearer, it's easier to follow lineage of columns, and after the project is "built", all intermediate tables and views remain, making it easier to debug (as opposed to mutating data as we go, where intermediate state of the build is lost) when things go wrong.

Thinking from this perspective, what would be a good starting point is a flow chart where we model each operation like this as a new node. So to reuse diagrams from above, instead of

flowchart LR
a --> a

We would write

flowchart LR
a --> a'

This doesn't mean that we will blindly turn each of these nodes from this process into models - the next step will be "grouping" nodes. Say we have

Looking at this, it seems fairly apparent that we could group these a nodes together, and perform each of these update statements together.

Similarly, if we have a bunch of update statements in a row using additional tables, there might be room for simplification. Say we update a three times in a row, where each update uses another table instead this time.

flowchart LR
x --> a
y --> a
a --> a'
d --> a'
a' --> a''
e --> a''
a'' --> a'''
f --> a'''

This maybe is ripe for grouping as well - this maybe could be done with a single query joining a, d, e, and f. But this query could also get long and complicated. This will take looking into the queries a bit - if any of the tables d, e, or f are related, maybe they could be joined earlier. But it really depends on the underlying logic, and will need to be addressed on a case-by-case basis

Anyways. The first step to me seems to be mapping out the current logic into dbt entities, as I've described here (prime notation is fine), and then it'll take some effort to figure out both using the graph structure and investigating the entities in each of the tables, intermediates, and queries to see where we can group nodes to make queries simpler when we start moving sql to a dbt model framework

HengJiang0206 commented 1 week ago
flowchart 
dcp_commercialoverlay --preprocessing (rename cols)--> dcp_commercialoverlay'
dcp_limitedheight --preprocessing (rename cols)--> dcp_limitedheight'
dcp_specialpurposesubdistricts --preprocessing (rename cols)--> dcp_specialpurposesubdistricts'
dcp_specialpurpose --preprocessing (rename cols)--> dcp_specialpurpose'
dcp_zoningdistricts --preprocessing (rename cols)--> dcp_zoningdistricts'
dcp_zoningmapamendments --preprocessing (rename cols)--> dcp_zoningmapamendments'
dcp_zoningmapindex --preprocessing (rename cols)--> dcp_zoningmapindex'
dof_dtm --preprocessing (rename cols)--> dof_dtm'
dof_dtm' --CTE coalesced--> dof_dtm''
dof_dtm'' --CREATE dof_dtm_tmp --> dof_dtm'''
dof_dtm''' --index--> dof_dtm''''
dof_dtm'''' --CREATE validdtm--> dof_dtm'''''
dof_dtm''''' --index--> dof_dtm''''''
dcp_zoningdistricts' --CREATE validzones--> dcp_zoningdistricts''
dcp_zoningdistricts'' --index--> dcp_zoningdistricts'''
dcp_zoningdistricts''' --> lotzoneper
dof_dtm''''''--> lotzoneper
lotzoneper --CREATE lotzoneper_grouped--> lotzoneper'
lotzoneper' --CTE initial_rankings--> lotzoneper''
lotzoneper'' --CREATE lotzoneperorder_init--> lotzoneper'''
lotzoneper''' --CTE group_column_added--> lotzoneper''''
lotzoneper'''' --CTE reorder_groups --> lotzoneper'''''
lotzoneper'''''--CTE rows_to_reorder --> lotzoneper''''''
zonedist_priority--CREATE and INSERT values-->zonedist_priority'
zonedist_priority'--CTE new_order--> lotzoneper'''''''
lotzoneper''''''--CTE new_order --> lotzoneper'''''''
lotzoneper'''--CREATE lotzoneperorder--> lotzoneper''''''''
lotzoneper'''''''--CREATE lotzoneperorder--> lotzoneper''''''''
dcp_zoning_taxlot--CREATE empty table-->dcp_zoning_taxlot'
dof_dtm''''--INSERT bbl...-->dcp_zoning_taxlot''
dcp_zoning_taxlot'--INSERT bbl-->dcp_zoning_taxlot''
dcp_zoning_taxlot''--SET zoningdistrict (1-4)-->dcp_zoning_taxlot'''
lotzoneper'''''''' --SET zoningdistrict (1-4)--> dcp_zoning_taxlot'''
dcp_commercialoverlay'--CTE commoverlayper--> dcp_commercialoverlay''
dof_dtm''''''--CTE commoverlayper-->dcp_commercialoverlay''
dcp_commercialoverlay''--CREATE commoverlayperorder-->dcp_commercialoverlay'''
dcp_commercialoverlay'''--SET commercialoverlay (1-2)-->dcp_zoning_taxlot''''
dcp_zoning_taxlot'''--SET commercialoverlay (1-2)-->dcp_zoning_taxlot''''
dof_dtm''''''--CREATE specialpurposeper-->dcp_specialpurpose''
dcp_specialpurpose'--CREATE specialpurposeper-->dcp_specialpurpose''
dcp_specialpurpose''--CTE specialpurposeperorder_init-->dcp_specialpurpose'''
dcp_specialpurpose'''--CREATE specialpurposeperorder-->dcp_specialpurpose''''
dcp_specialpurpose''''--SET specialdistrict(1-3)--> dcp_zoning_taxlot'''''
 dcp_zoning_taxlot''''--SET specialdistrict(1-3)--> dcp_zoning_taxlot'''''
dof_dtm''''''--CTE limitedheightper-->dcp_limitedheight''
dcp_limitedheight'--CTE limitedheightper-->dcp_limitedheight''
dcp_limitedheight''--CREATE limitedheightperorder-->dcp_limitedheight'''
dcp_limitedheight'''--SET limitedheightdistrict--> dcp_zoning_taxlot''''''
 dcp_zoning_taxlot'''''--SET limitedheightdistrict--> dcp_zoning_taxlot''''''
dof_dtm''''''--CREATE zoningmapper-->dcp_zoningmapindex'''
dcp_zoningmapindex'--CTE validindex-->dcp_zoningmapindex''
dcp_zoningmapindex''--CREATE zoningmapper-->dcp_zoningmapindex'''
dcp_zoningmapindex'''--CREATE zoningmapperorder-->dcp_zoningmapindex''''
dcp_zoningmapindex''''--SET zoningmapnumber zoningmapcode--> dcp_zoning_taxlot'''''''
 dcp_zoning_taxlot''''''--SET zoningmapnumber zoningmapcode--> dcp_zoning_taxlot'''''''
dcp_zoningmapamendments'--FLAG rezone-->dcp_zoningmapamendments''
dof_dtm''''--FLAG rezone-->dcp_zoningmapamendments''
dcp_zoningmapamendments'--FLAG inwood rezone-->dcp_zoningmapamendments'''
dof_dtm''''--FLAG inwood rezone-->dcp_zoningmapamendments'''
dcp_zoningmapamendments'''--SET notes-->dcp_zoning_taxlot''''''''
dcp_zoning_taxlot'''''''--SET notes-->dcp_zoning_taxlot''''''''
dcp_zoning_taxlot''''''''--PROCESSING parks-->dcp_zoning_taxlot'''''''''
dcp_zoningmapamendments''--SET inzonechange-->dcp_zoning_taxlot''''''''''
dcp_zoning_taxlot'''''''''--SET inzonechange-->dcp_zoning_taxlot''''''''''
dcp_zoning_taxlot''''''''''--DROP duplicate values-->dcp_zoning_taxlot'''''''''''
dcp_zoning_taxlot'''''''''''--CORRECT zoninggaps-->dcp_zoning_taxlot''''''''''''
dcp_zoning_taxlot''''''''''''--DROP invalid records-->dcp_zoning_taxlot'''''''''''''
dcp_zoning_taxlot'''''''''''''---->dcp_zoning_taxlot_export