Closed fvankrieken closed 4 months ago
Thought a bit more about what this initial flow chart should look like to be useful for creating dbt models.
As you know, dbt relies on your models being a DAG, only creating tables with select statements and not modifying tables with delete statements, update statements, or DDL (alter table, rename column, add column, etc). ZTL currently uses a lot these. However, each of these can be replaced with a select statement. Instead of
UPDATE a SET column1 = "true"
We instead have a new model
with the code
SELECT
'true' as column1,
column2,
column3
FROM a
dbt docs I think are fairly clear as to why this is useful - order of operations in sql becomes clearer, it's easier to follow lineage of columns, and after the project is "built", all intermediate tables and views remain, making it easier to debug (as opposed to mutating data as we go, where intermediate state of the build is lost) when things go wrong.
Thinking from this perspective, what would be a good starting point is a flow chart where we model each operation like this as a new node. So to reuse diagrams from above, instead of
flowchart LR
a --> a
We would write
flowchart LR
a --> a'
This doesn't mean that we will blindly turn each of these nodes from this process into models - the next step will be "grouping" nodes. Say we have
x
and y
are joined to make aflowchart LR
x --> a
y --> a
a --> a'
a' --> a''
a'' --> a'''
a''' --> a''''
a'''' --> c
b --> c
Looking at this, it seems fairly apparent that we could group these a nodes together, and perform each of these update statements together.
Similarly, if we have a bunch of update statements in a row using additional tables, there might be room for simplification. Say we update a three times in a row, where each update uses another table instead this time.
flowchart LR
x --> a
y --> a
a --> a'
d --> a'
a' --> a''
e --> a''
a'' --> a'''
f --> a'''
This maybe is ripe for grouping as well - this maybe could be done with a single query joining a, d, e, and f. But this query could also get long and complicated. This will take looking into the queries a bit - if any of the tables d, e, or f are related, maybe they could be joined earlier. But it really depends on the underlying logic, and will need to be addressed on a case-by-case basis
Anyways. The first step to me seems to be mapping out the current logic into dbt entities, as I've described here (prime notation is fine), and then it'll take some effort to figure out both using the graph structure and investigating the entities in each of the tables, intermediates, and queries to see where we can group nodes to make queries simpler when we start moving sql to a dbt model
framework
flowchart
dcp_commercialoverlay --preprocessing (rename cols)--> dcp_commercialoverlay'
dcp_limitedheight --preprocessing (rename cols)--> dcp_limitedheight'
dcp_specialpurposesubdistricts --preprocessing (rename cols)--> dcp_specialpurposesubdistricts'
dcp_specialpurpose --preprocessing (rename cols)--> dcp_specialpurpose'
dcp_zoningdistricts --preprocessing (rename cols)--> dcp_zoningdistricts'
dcp_zoningmapamendments --preprocessing (rename cols)--> dcp_zoningmapamendments'
dcp_zoningmapindex --preprocessing (rename cols)--> dcp_zoningmapindex'
dof_dtm --preprocessing (rename cols)--> dof_dtm'
dof_dtm' --CTE coalesced--> dof_dtm''
dof_dtm'' --CREATE dof_dtm_tmp --> dof_dtm'''
dof_dtm''' --index--> dof_dtm''''
dof_dtm'''' --CREATE validdtm--> dof_dtm'''''
dof_dtm''''' --index--> dof_dtm''''''
dcp_zoningdistricts' --CREATE validzones--> dcp_zoningdistricts''
dcp_zoningdistricts'' --index--> dcp_zoningdistricts'''
dcp_zoningdistricts''' --> lotzoneper
dof_dtm''''''--> lotzoneper
lotzoneper --CREATE lotzoneper_grouped--> lotzoneper'
lotzoneper' --CTE initial_rankings--> lotzoneper''
lotzoneper'' --CREATE lotzoneperorder_init--> lotzoneper'''
lotzoneper''' --CTE group_column_added--> lotzoneper''''
lotzoneper'''' --CTE reorder_groups --> lotzoneper'''''
lotzoneper'''''--CTE rows_to_reorder --> lotzoneper''''''
zonedist_priority--CREATE and INSERT values-->zonedist_priority'
zonedist_priority'--CTE new_order--> lotzoneper'''''''
lotzoneper''''''--CTE new_order --> lotzoneper'''''''
lotzoneper'''--CREATE lotzoneperorder--> lotzoneper''''''''
lotzoneper'''''''--CREATE lotzoneperorder--> lotzoneper''''''''
dcp_zoning_taxlot--CREATE empty table-->dcp_zoning_taxlot'
dof_dtm''''--INSERT bbl...-->dcp_zoning_taxlot''
dcp_zoning_taxlot'--INSERT bbl-->dcp_zoning_taxlot''
dcp_zoning_taxlot''--SET zoningdistrict (1-4)-->dcp_zoning_taxlot'''
lotzoneper'''''''' --SET zoningdistrict (1-4)--> dcp_zoning_taxlot'''
dcp_commercialoverlay'--CTE commoverlayper--> dcp_commercialoverlay''
dof_dtm''''''--CTE commoverlayper-->dcp_commercialoverlay''
dcp_commercialoverlay''--CREATE commoverlayperorder-->dcp_commercialoverlay'''
dcp_commercialoverlay'''--SET commercialoverlay (1-2)-->dcp_zoning_taxlot''''
dcp_zoning_taxlot'''--SET commercialoverlay (1-2)-->dcp_zoning_taxlot''''
dof_dtm''''''--CREATE specialpurposeper-->dcp_specialpurpose''
dcp_specialpurpose'--CREATE specialpurposeper-->dcp_specialpurpose''
dcp_specialpurpose''--CTE specialpurposeperorder_init-->dcp_specialpurpose'''
dcp_specialpurpose'''--CREATE specialpurposeperorder-->dcp_specialpurpose''''
dcp_specialpurpose''''--SET specialdistrict(1-3)--> dcp_zoning_taxlot'''''
dcp_zoning_taxlot''''--SET specialdistrict(1-3)--> dcp_zoning_taxlot'''''
dof_dtm''''''--CTE limitedheightper-->dcp_limitedheight''
dcp_limitedheight'--CTE limitedheightper-->dcp_limitedheight''
dcp_limitedheight''--CREATE limitedheightperorder-->dcp_limitedheight'''
dcp_limitedheight'''--SET limitedheightdistrict--> dcp_zoning_taxlot''''''
dcp_zoning_taxlot'''''--SET limitedheightdistrict--> dcp_zoning_taxlot''''''
dof_dtm''''''--CREATE zoningmapper-->dcp_zoningmapindex'''
dcp_zoningmapindex'--CTE validindex-->dcp_zoningmapindex''
dcp_zoningmapindex''--CREATE zoningmapper-->dcp_zoningmapindex'''
dcp_zoningmapindex'''--CREATE zoningmapperorder-->dcp_zoningmapindex''''
dcp_zoningmapindex''''--SET zoningmapnumber zoningmapcode--> dcp_zoning_taxlot'''''''
dcp_zoning_taxlot''''''--SET zoningmapnumber zoningmapcode--> dcp_zoning_taxlot'''''''
dcp_zoningmapamendments'--FLAG rezone-->dcp_zoningmapamendments''
dof_dtm''''--FLAG rezone-->dcp_zoningmapamendments''
dcp_zoningmapamendments'--FLAG inwood rezone-->dcp_zoningmapamendments'''
dof_dtm''''--FLAG inwood rezone-->dcp_zoningmapamendments'''
dcp_zoningmapamendments'''--SET notes-->dcp_zoning_taxlot''''''''
dcp_zoning_taxlot'''''''--SET notes-->dcp_zoning_taxlot''''''''
dcp_zoning_taxlot''''''''--PROCESSING parks-->dcp_zoning_taxlot'''''''''
dcp_zoningmapamendments''--SET inzonechange-->dcp_zoning_taxlot''''''''''
dcp_zoning_taxlot'''''''''--SET inzonechange-->dcp_zoning_taxlot''''''''''
dcp_zoning_taxlot''''''''''--DROP duplicate values-->dcp_zoning_taxlot'''''''''''
dcp_zoning_taxlot'''''''''''--CORRECT zoninggaps-->dcp_zoning_taxlot''''''''''''
dcp_zoning_taxlot''''''''''''--DROP invalid records-->dcp_zoning_taxlot'''''''''''''
dcp_zoning_taxlot'''''''''''''---->dcp_zoning_taxlot_export
Once dbt is set up, as a first step of dbt-ifying existing sql in zoning tax lots database, we'll need to figure out what the structure/flow of dbt models will look like.
dbt does not support updates, or cyclic references (one table used to build another which is used to update the first table, stuff along those lines). So much of the code of this repo will need to be restructured.
Simple case - preprocessing
One simpler case will be what happens in
preprocessing.sql
. This in a sense is whatstaging
dbt tables are for - taking our inputs and getting them "ready" for use in our pipeline. Take dtm for example. Right now, wepreprocessing
, weGiving us something like this
To turn this into a dbt dag, we simply move forwards only, rather than update an existing table.
dof_dtm
must exist in _sources.yml, and then inmodels/staging/stg__dof_dtm
we have a single query which hasGiving us a much cleaner flow of Giving us something like this
In the current build of ztl, once
preprocessing.sql
is run, we lose our original tables. Now with dbt, we'll have the original preserveddof_dtm
and a table/viewstg__dof_dtm
which can be used in further models/queries.More complex - build tables
When we're building, we have slightly more complex logic due to all the update statements. This is going to take a bit more work. But essentially, each sql file has some sort of domain, and likely can be simplified to one or two models that end up being joined at the end to make our final table.
A while back, I made a flow chart to represent ztl. This will be useful (you'll need to make at least a free-tier account to view)
https://lucid.app/lucidchart/5ddbf6ef-3185-4071-86c4-61277dd9510d/edit?page=0_0&invitationId=inv_6cedd686-76a7-491f-bc8c-cc3746ed38f2#
A new diagram should be made in draw.io that represents a more idealized dag with no updates. This will likely change as development goes on, but will be a good starting place and will ensure that we understand the transformations that are occuring before we actually start changing code