NYCPlanning / db-developments

🏠 🏘️ 🏗️ Developments Database
https://nycplanning.github.io/db-developments
8 stars 2 forks source link

Ingest DOB NOW Permits and Update Status Q Logics #501

Closed td928 closed 2 years ago

td928 commented 2 years ago

Overview

address #500 which was first spotted when housing was looking status q and found that DOB NOW portion of the HDB is missing their status q field, which reflects earliest permit date associated with a project. After conversations with DOB, it was determined that a new dataset needs to be incorporated and it was delivered through the same DOB FTP process to us. A new workflow similar to DOB NOW applications is created in data library and is ingested into DevDB.

dataloading

dataloading for the new dob now permits is simple enough by reading the datasets from edm-recipes using the import_public function.

_status_q.sql

the status q ingesting and cleaning were done in the _status_q.sql. With new DOB NOW permits file, ~180,000 records were added to the STATUS_Q_devdb table. With the raw file, ~400,000 records were filtered down with two logics 1) only I1 (stands for "initial" according to Sam) records were included and 2) only the earliest permit for each job_filing_number is included because that is what the status q concerns about.

Another major change for this workflow involves the removing of the joining to the init_devdb table to create the STATUS_Q_devdb table in this .sql file. A more detailed rationale for this removal will be explained in the section below.

_mid.sql

Even though no change is made to the _mid.sql in this PR, but it is actually crucial to understanding the status q process. Because the STATUS_Q_devdb, and thereof the earliest permit date, is merged onto the main devdb table in the _mid.sql. The merge is a LEFT JOIN based on job_number. One issue, however, is that while the permit records themselves are unique due to the filtering logics mentioned above, thejob_number in init_devdb is not unique at this stage of the table. This leads the two subsequent join clause in _status_q.sql and then _mid.sql will multiply the table to be extremely large and often crash the program. I made the change to remove the first join with init_devdb completely because it is redundant in my opinion to do the functionally same join twice.

SashaWeinstein commented 2 years ago

I built on my local and then grouped jobs by permit year. STATUS_Q_devdb had 35526 jobs with permit year of 2022, but FINAL_devdb only has 2422. On the other hand, STATUS_Q_devdb has 5708 jobs with permit year of 2011, and FINAL_devdb has 5705. (I chose these two years at random). Te can you help me understand where this dropoff comes from?

SashaWeinstein commented 2 years ago

To check that the creation of the permit_year is correct I ran this

with 
recent_permits as (SELECT * from dob_permitissuance WHERE issuancedate::date > DATE '2000-01-01' ),
recent_dob_now_permits as 
(SELECT * from dob_now_permits WHERE issued_date::date > DATE '2000-01-01'),
STATUS_Q_create as (
    SELECT 
        jobnum as job_number, 
        min(issuancedate::date) as date_permittd
    FROM recent_permits
    WHERE jobdocnum = '01'
    AND jobtype ~* 'A1|DM|NB'
    GROUP BY jobnum
    UNION
    SELECT 
        left(job_filing_number, strpos(job_filing_number, '-') - 1)::text as job_number,
        min(issued_date::date) as date_permittd 
    FROM recent_dob_now_permits
    WHERE right(job_filing_number,2)='I1'
    GROUP BY left(job_filing_number, strpos(job_filing_number, '-') - 1)::text
),
STATUS_Q_devdb_recent as (
SELECT 
    job_number,
    date_permittd,
    -- year_permit
    extract(year from date_permittd)::text as permit_year,
    -- quarter_permit
    year_quarter(date_permittd) as permit_qrtr
    FROM status_Q_create
)
SELECT permit_year, count(*) from STATUS_Q_devdb_recent group by permit_year order by permit_year

Which shows the earliest permit year as 2000

SashaWeinstein commented 2 years ago

Ran this query to see that creation of permit_qrtr is correct. Filtering for march records means all records are assigned Q1 which seems correct

with 
march_permits as (SELECT * from dob_permitissuance WHERE EXTRACT(Month from issuancedate::date) = 3 ),
march_dob_now_permits as 
(SELECT * from dob_now_permits WHERE EXTRACT(Month from issued_date::date) =3),
STATUS_Q_create as (
    SELECT 
        jobnum as job_number, 
        min(issuancedate::date) as date_permittd
    FROM march_permits
    WHERE jobdocnum = '01'
    AND jobtype ~* 'A1|DM|NB'
    GROUP BY jobnum
    UNION
    SELECT 
        left(job_filing_number, strpos(job_filing_number, '-') - 1)::text as job_number,
        min(issued_date::date) as date_permittd 
    FROM march_dob_now_permits
    WHERE right(job_filing_number,2)='I1'
    GROUP BY left(job_filing_number, strpos(job_filing_number, '-') - 1)::text
),
STATUS_Q_devdb_recent as (
SELECT 
    job_number,
    date_permittd,
    -- year_permit
    extract(year from date_permittd)::text as permit_year,
    -- quarter_permit
    year_quarter(date_permittd) as permit_qrtr
    FROM status_Q_create
)
SELECT permit_qrtr, count(*) from STATUS_Q_devdb_recent group by permit_qrtr order by permit_qrtr
SashaWeinstein commented 2 years ago

Will be ready to approve once Te and I talk about the dropoff of 2022 records from STATUS_Q_devdb to final_devdb

mbh329 commented 2 years ago

Everything ran successfully on my machine and went through and looked at some of the outputs. I ran a count of classa_net records by year and got a range of 3000 - 7400 units created per year with ~30000 units not ever being assigned a year but this seems to come from withdrawn permits or stalled permits.