NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
22 stars 0 forks source link

DBT ZTL: Simplify DAG for first-draft implementation #924

Closed fvankrieken closed 4 months ago

fvankrieken commented 4 months ago

Follow up to #911 . After initial flow chart/DAG is made, we should attempt to simplify it, grouping steps as needed to create a simple, relatively atomic DAG to create ZTL. #911 has some brief description of what this might look like, but it's very dependent on what the initial DAG ends up looking like, so this description will remain a bit of a stub at the moment

This should also include deciding on initial "grouping" of models. stg, int, product as well as any "sub-groupings" (like int_flags in gft). GFT is a good example but also has a lot of assumed knowledge there. This wiki page has good guidance (as well as links to dbt's official guidance)

HengJiang0206 commented 4 months ago
flowchart 
dcp_commercialoverlay--preprocessing--> stg__dcp_commercialoverlay
dcp_limitedheight --preprocessing --> stg__dcp_limitedheight
dof_dtm--preprocessing------->stg__dof_dtm
stg__dof_dtm---->int__dof_dtm_1
dcp_specialpurpose --preprocessing --> stg__dcp_specialpurpose
dcp_zoningdistricts --preprocessing --> stg__dcp_zoningdistricts
dcp_zoningmapamendments --preprocessing --> stg__dcp_zoningmapamendments
dcp_zoningmapindex --preprocessing --> stg__dcp_zoningmapindex
stg__dof_dtm---->int__dcp_zoningmapamendments_1
stg__dcp_zoningmapamendments---->int__dcp_zoningmapamendments_1
stg__dof_dtm---->int__dcp_zoningmapamendments_2
stg__dcp_zoningmapamendments--->int__dcp_zoningmapamendments_2
int__dof_dtm_1---->int__dcp_zoningmapindex_1
stg__dcp_zoningmapindex---->int__dcp_zoningmapindex_1
int__dcp_zoningmapindex_1---->int__dcp_zoningmapindex_2
int__dof_dtm_1---->int__dcp_limitedheight_1
stg__dcp_limitedheight---->int__dcp_limitedheight_1
int__dcp_limitedheight_1---->int__dcp_limitedheight_2
int__dof_dtm_1---->int__dcp_specialpurpose_1
stg__dcp_specialpurpose---->int__dcp_specialpurpose_1
int__dcp_specialpurpose_1---->int__dcp_specialpurpose_2
int__dof_dtm_1---->int__dcp_commercialoverlay_1
stg__dcp_commercialoverlay---->int__dcp_commercialoverlay_1
int__dcp_commercialoverlay_1---->int__dcp_commercialoverlay_2
dcp_zoning_taxlot---->stg__dcp_zoning_taxlot
stg__dcp_zoning_taxlot---->int__dcp_zoning_taxlot_1
stg__dof_dtm---->int__dcp_zoning_taxlot_1
stg__dcp_zoningdistricts---->int__dcp_zoningdistricts_1
int__dof_dtm_1---->int__dcp_zoningdistricts_1
int__dcp_zoningdistricts_1---->int__dcp_zoningdistricts_2
int__dcp_zoningdistricts_2---->int__dcp_zoningdistricts_3
seed_zone_dist_priority---->int__dcp_zoningdistricts_4
int__dcp_zoningdistricts_3---->int__dcp_zoningdistricts_4
int__dcp_zoningdistricts_4---->int__dcp_zoningdistricts_5
int__dcp_zoningdistricts_2---->int__dcp_zoningdistricts_5
int__dcp_zoningdistricts_5---->int__dcp_zoning_taxlot_2
int__dcp_zoning_taxlot_1---->int__dcp_zoning_taxlot_2
int__dcp_zoning_taxlot_2----> int__dcp_zoning_taxlot_3
int__dcp_commercialoverlay_2----> int__dcp_zoning_taxlot_3
int__dcp_zoning_taxlot_3----> int__dcp_zoning_taxlot_4
int__dcp_specialpurpose_2---->int__dcp_zoning_taxlot_4
int__dcp_zoning_taxlot_4---->int__dcp_zoning_taxlot_5
int__dcp_limitedheight_2---->int__dcp_zoning_taxlot_5
int__dcp_zoning_taxlot_5---->int__dcp_zoning_taxlot_6
int__dcp_zoningmapindex_2---->int__dcp_zoning_taxlot_6
int__dcp_zoning_taxlot_6---->int__dcp_zoning_taxlot_7
int__dcp_zoningmapamendments_2---->int__dcp_zoning_taxlot_7
int__dcp_zoning_taxlot_7---->int__dcp_zoning_taxlot_8
int__dcp_zoning_taxlot_8---->int__dcp_zoning_taxlot_9
int__dcp_zoningmapamendments_1---->int__dcp_zoning_taxlot_9
int__dcp_zoning_taxlot_9---->dcp_zoning_taxlot_export
fvankrieken commented 4 months ago

This is a great first effort!

So we'll chat about this later today and probably also Monday, but I think when it comes to dcp_zoning_taxlot we're going to approach it very differently. This is certainly an idealized version (from the DAG's perspective) of what this could look like, but I think we'll want to aim for something along these lines.

flowchart LR
stg__dof_dtm---->int__zoningmapamendments
stg__dcp_zoningmapamendments---->int__zoningmapamendments
stg__dof_dtm---->int__zoningmapindex
stg__dcp_zoningmapindex---->int__zoningmapindex
stg__dof_dtm---->int__dcp_limitedheight
stg__dcp_limitedheight---->int__limitedheight
stg__dof_dtm---->int__specialpurpose
stg__dcp_specialpurpose---->int__specialpurpose
stg__dof_dtm---->int__commercialoverlay
stg__dcp_commercialoverlay---->int__commercialoverlay
stg__dof_dtm---->int__zoning_taxlot
stg__dcp_zoningdistricts---->int__zoningdistricts
stg__dof_dtm---->int__zoningdistricts
seed_zone_dist_priority---->int__zoningdistricts
stg__dof_dtm---->dcp_zoning_taxlot
int__zoningdistricts---->dcp_zoning_taxlot
int__commercialoverlay----> dcp_zoning_taxlot
int__specialpurpose---->dcp_zoning_taxlot
int__limitedheight---->dcp_zoning_taxlot
int__zoningmapindex---->dcp_zoning_taxlot
int__zoningmapamendments---->dcp_zoning_taxlot
int__zoningmapamendments---->dcp_zoning_taxlot

This is obviously a bit idealized. I think what will make sense moving forward will be taking each of these "int" tables and trying to implement them one at a time. For each, see if it can truly be a single model, or if we will need multiple intermediates like in both the current code and in your flowchart - it's quite possible. And then, once each of them are implemented, work on creating the final table.

So for #925, process will look like

fvankrieken commented 4 months ago

And I'm not providing this flowchart as a "right answer" to the issue here - it really is an idealized version that's not really taking into account the complexity in the queries (and optimizing for performance, etc). In #911, you made something that represented reality, and in this issue, you've made something trying to simplify what actually exists, whereas what I made is something that's a bit more naive. I think these probably have us in a good spot of trying to implement, and we'll likely end up somewhere in between