NYCPlanning / db-developments

🏠 🏘️ 🏗️ Developments Database
https://nycplanning.github.io/db-developments
8 stars 2 forks source link

20Q2 - Improve removal of duplicate jobs #31

Closed AmandaDoyle closed 4 years ago

AmandaDoyle commented 4 years ago

Translated logic from comments below:

WITH
DUPLICATES_equal_units AS (
    SELECT 
        a.job_number, 
        b.job_number as duplicate_job_number,
        'duplicate with same units_net' as reason
    FROM MID_devdb a
    JOIN MID_devdb b
    ON a.job_type = b.job_type
    AND a.bbl <> b.bbl
    AND a.address = b.address
    AND a.classa_net = b.classa_net
    AND a.job_inactive IS NULL 
    AND b.job_inactive IS NULL),
DUPLICATES_diff_units AS (
    SELECT 
        a.job_number, 
        b.job_number as duplicate_job_number,
        'duplicate with dif units_net' as reason
    FROM MID_devdb a
    JOIN MID_devdb b
    ON a.job_type = b.job_type
    AND a.bbl <> b.bbl
    AND a.address = b.address
    AND a.classa_net <> b.classa_net
    AND a.job_inactive IS NULL 
    AND b.job_inactive IS NULL)
SELECT * FROM DUPLICATES_equal_units
UNION SELECT * FROM DUPLICATES_diff_units;

Based on HED criteria, implement logic to identify and remove duplicate job records

Current logic: Remove records where the job number is the same https://github.com/NYCPlanning/db-developments/blob/master/developments_build/sql/dedupe_job_number.sql

Identify potential duplicates https://github.com/NYCPlanning/db-developments/blob/master/developments_build/sql/x_duplicate.sql

kschmidtDCP commented 4 years ago

Current logic to remove records where the job number is the same should remain unchanged: https://github.com/NYCPlanning/db-developments/blob/master/developments_build/sql/dedupe_job_number.sql

To our knowledge, possible duplicates with different job numbers are not removed - instead they are flagged as inactive through manual research. Please confirm. We would like to use this logic instead for flagging these jobs, rather than what's in x_duplicate.sql.

kschmidtDCP commented 4 years ago

In the new deduplication logic (#31), we filter to jobs that are not inactive. Do both of the potential duplicate jobs need to have x_inactive IS NULL to be output to the QAQC table, or only one job?

@mgraber Both jobs need to be active. (Slightly different from one is inactive - sorry for lack of clarity, I'll edit above.)

kschmidtDCP commented 4 years ago

100: Similarly, we remove duplicates programatically. Should we not automatically remove them? If we keep the removal step, are the duplicate QAQC checks to be done before or after removal? #31 #106

@mgraber @AmandaDoyle There are two types of duplicates, which we handle differently.

  1. Duplicate job numbers are handled programmatically here: https://github.com/NYCPlanning/db-developments/blob/master/developments_build/sql/dedupe_job_number.sql
  2. Suspected duplicates with different job numbers are handled in the QAQC tables. #106