NYCPlanning / db-knownprojects

KPDB: A compilation of prospective residential development projects from various sources, with rough projections of new unit counts
https://nycplanning.github.io/db-knownprojects
0 stars 0 forks source link

Future Enhancement: Deprecate `db-knownprojects-data` #405

Open mbh329 opened 1 year ago

mbh329 commented 1 year ago

This data repository should be deprecated in favor of using db-data-library. I think one of the reasonings behind creating the data repo originally was that data library wasn't fully functional (an assumption) and the data was meant to be kept private as it contains sensitive information not available to the public.

Generally, all the data repo does is process either csv files, xlsx files, or geospatial files into SQL files. Data Library has the functionality do this and also apply some standards we use when we archive input/source data.

  1. Create data templates for the source data we receive from SL in DCP Housing. This will allows us to archive source data and we no longer have to keep the data in the raw folder where we constantly change the dates of things, have multiple files with similar data etc...
    • You can set the access level directly in a data library yaml template (e.g. ACL - Private). These should all be set to private
    • I think there is room to work with SL and data providers to standardize the inputs we receive from them. For one, we can send them a "standard" data template which includes all the applicable data we need for the update KPDB build pipeline.For example:
dcp_n_study uid project_na source project_id total_unit counted_un within5 5to10 after10 geometry

hpd_rfp

Request for Proposals Name BoroCode Block Lot BBL agency est_units designated closed est_closing closed_date likely to be built by 2025? NOTE

The above is a list of columns we get from the dcp_n_study which is one of our source datasets. In the current pipeline, we create a function that brings in the borough but we could also be bringing this information earlier from the HPD data and likewise for other source data inputs. I think if we standardize some of the source data inputs than we can get higher accuracy in our final SCA aggregate tables.