QAQC: KPDB - Final issues

NYCPlanning / db-knownprojects

KPDB: A compilation of prospective residential development projects from various sources, with rough projections of new unit counts

https://nycplanning.github.io/db-knownprojects

0 stars 0 forks source link

QAQC: KPDB - Final issues #353

Closed SPTKL closed 1 year ago

SPTKL commented 3 years ago

[x] phasing null for DOB records #354
[ ] Repeated records (probably a join issue?)
- a90a97fb425f67212618ce0151aaff28
- 321597385
- 510115670

mgraber commented 3 years ago

Potential places where duplicates are introduced:

https://github.com/NYCPlanning/db-knownprojects/blob/5498909936275e9b9d1f8abd0b5ec6d921590223/sql/create_kpdb.sql#L12

td928 commented 3 years ago

@mgraber @SPTKL May have found the issues with dob phasing assumption

https://github.com/NYCPlanning/db-knownprojects/blob/d37cac22bd175b2813f51c5e57e32557cf39587d/sql/dcp_housing.sql#L110-L114

Some of the dob records have inactive flag as '0' instead of null might cause their phasing assumption not assigned.

mgraber commented 3 years ago

@td928 The inactive flag for dob records get changed to 0 or 1 here: https://github.com/NYCPlanning/db-knownprojects/blob/d37cac22bd175b2813f51c5e57e32557cf39587d/sql/dcp_housing.sql#L107 The source data (DevDB) uses text values rather than boolean. See #354.

td928 commented 3 years ago

Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?

mgraber commented 3 years ago

Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?

@td928 All DOB records? Or all records regardless of source?

SPTKL commented 3 years ago

it seems like there are a hand full of records in deduped_units that have multiple project_ids, which led to the duplicates

td928 commented 3 years ago

Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?

@td928 All DOB records? Or all records regardless of source?

sorry I meant the DOB records.

mgraber commented 3 years ago

Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?

@td928 All DOB records? Or all records regardless of source?

sorry I meant the DOB records.

@td928 are you talking about the input data or the DOB records as they show up in KPDB? The input data has Inactive: Withdrawn, Inactive: Stalled, Inactive: Duplicate, NULL, while KPDB simplifies this to a boolean 0/1.

td928 commented 3 years ago

Thank you @mgraber ! This is really helpful context. I am a little confused then why there are any null values in the job_inactive. Seems like they should all be 0 or 1?

@td928 All DOB records? Or all records regardless of source?

sorry I meant the DOB records.

@td928 are you talking about the input data or the DOB records as they show up in KPDB? The input data has Inactive: Withdrawn, Inactive: Stalled, Inactive: Duplicate, NULL, while KPDB simplifies this to a boolean 0/1.

Sorry I think I was just confused about the difference between inactive in KPDB and inactive in DevDB. I think the issue is resolved and I closed my pull request to change condition. @mgraber

td928 commented 3 years ago

the gowanus record appears in the final combined table twice. I think there is an early duplication introduced here as the source file is read in although cannot be sure because I don't know what is in dcp_rezoning.

https://github.com/NYCPlanning/db-knownprojects/blob/8d6edac30fd28cf19d82a28b955826af88a8012b/sql/combine.sql#L177-L179

Then as @SPTKL mentioned earlier, the dedup process further multiplied it here

https://github.com/NYCPlanning/db-knownprojects/blob/557619af0da8ff5211cf8ec75f9c76a77f83768c/python/dedup_units.py#L57-L60

mgraber commented 3 years ago

@td928 thanks. I will continue to look into this.

mgraber commented 3 years ago

@td928 The two DOB records show up as duplicates because they spatially match with multiple projects (in the dob_review table, they are flagged as multi-matches that need to get resolved), but are not assigned to a single project in the dob_corrections table.

mgraber commented 3 years ago

@td928 When we read in the dcp_n_study_future data and join with dcp_rezoning geometries, we get duplicates because there are two geometries associated with the neighborhood Gowanus. One is the rezoning area and one is the context area. Which should we be joining with the Gowanus record in the future rezonings data? If there is a consistent logic, we can add it to the join. Otherwise, we should remove the geometry we don't want joined from the input data.

td928 commented 3 years ago

@mgraber I will see about the dob records and make the appropriate corrections for them in the cluster review table.

About the rezoning, I think using the rezoning area in this case should be fine and can exclude the one for context area.

Thank you!

mgraber commented 3 years ago

@mgraber I will see about the dob records and make the appropriate corrections for them in the cluster review table.

About the rezoning, I think using the rezoning area in this case should be fine and can exclude the one for context area.

Thank you!

@td928 Is excluding context areas something you'd like us to do across the whole dataset? If you look in the input shapefile, there are 10 context areas. If it is just for Gowanus, could you update the input data so we're not hard-coding an exception?

td928 commented 3 years ago

@mgraber we looked into the file and removed a few context area that is probably irrelevant to our purposes including the Gowanus context area. I created this pull request to update the nyc_rezonings source file.

github-actions[bot] commented 1 year ago

Stale issue message