Closed SPTKL closed 1 year ago
Potential places where duplicates are introduced:
@mgraber @SPTKL May have found the issues with dob phasing assumption
Some of the dob records have inactive flag as '0' instead of null might cause their phasing assumption not assigned.
@td928 The inactive flag for dob records get changed to 0 or 1 here: https://github.com/NYCPlanning/db-knownprojects/blob/d37cac22bd175b2813f51c5e57e32557cf39587d/sql/dcp_housing.sql#L107 The source data (DevDB) uses text values rather than boolean. See #354.
Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?
Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?
@td928 All DOB records? Or all records regardless of source?
it seems like there are a hand full of records in deduped_units that have multiple project_ids, which led to the duplicates
Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?
@td928 All DOB records? Or all records regardless of source?
sorry I meant the DOB records.
Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?
@td928 All DOB records? Or all records regardless of source?
sorry I meant the DOB records.
@td928 are you talking about the input data or the DOB records as they show up in KPDB? The input data has Inactive: Withdrawn
, Inactive: Stalled
, Inactive: Duplicate
, NULL
, while KPDB simplifies this to a boolean 0/1.
Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?
@td928 All DOB records? Or all records regardless of source?
sorry I meant the DOB records.
@td928 are you talking about the input data or the DOB records as they show up in KPDB? The input data has
Inactive: Withdrawn
,Inactive: Stalled
,Inactive: Duplicate
,NULL
, while KPDB simplifies this to a boolean 0/1.
Sorry I think I was just confused about the difference between inactive in KPDB and inactive in DevDB. I think the issue is resolved and I closed my pull request to change condition. @mgraber
the gowanus record appears in the final combined table twice. I think there is an early duplication introduced here as the source file is read in although cannot be sure because I don't know what is in dcp_rezoning.
Then as @SPTKL mentioned earlier, the dedup process further multiplied it here
@td928 thanks. I will continue to look into this.
@td928 The two DOB records show up as duplicates because they spatially match with multiple projects (in the dob_review table, they are flagged as multi-matches that need to get resolved), but are not assigned to a single project in the dob_corrections table.
@td928 When we read in the dcp_n_study_future data and join with dcp_rezoning geometries, we get duplicates because there are two geometries associated with the neighborhood Gowanus. One is the rezoning area and one is the context area. Which should we be joining with the Gowanus record in the future rezonings data? If there is a consistent logic, we can add it to the join. Otherwise, we should remove the geometry we don't want joined from the input data.
@mgraber I will see about the dob records and make the appropriate corrections for them in the cluster review table.
About the rezoning, I think using the rezoning area in this case should be fine and can exclude the one for context area.
Thank you!
@mgraber I will see about the dob records and make the appropriate corrections for them in the cluster review table.
About the rezoning, I think using the rezoning area in this case should be fine and can exclude the one for context area.
Thank you!
@td928 Is excluding context areas something you'd like us to do across the whole dataset? If you look in the input shapefile, there are 10 context areas. If it is just for Gowanus, could you update the input data so we're not hard-coding an exception?
@mgraber we looked into the file and removed a few context area that is probably irrelevant to our purposes including the Gowanus context area. I created this pull request to update the nyc_rezonings source file.
Stale issue message