NYCPlanning / db-knownprojects

KPDB: A compilation of prospective residential development projects from various sources, with rough projections of new unit counts
https://nycplanning.github.io/db-knownprojects
0 stars 0 forks source link

DOB Clustering Failure #360

Closed td928 closed 1 year ago

td928 commented 2 years ago

Initially thought this might be one off case where the DOB clustering failed for certain projects. But upon further review reveals that a systemic issue exists and likely affecting all DOB clustering.

One example of this is the ESD project for Atlantic Yard. The DOB records (record ids: 321590532, 321595403) associated with Atlantic Yard recognizes itself as clustered with the ESD project, as indicated by their project_record_ids field in review_dob. On the other hand, the ESD project does NOT recognize itself as in cluster with those records and therefore units did not get duplicated in the final KPDB.

I believe this is a DOB clusters step issue Further exam review_dob file indicates that the issues might be that the dob record ids are simply not included in the project_record_ids. When the matches table is created in the review_dob.sql the dob records ids are not appended to the other project_record_ids coming from the last clustering step.

https://github.com/NYCPlanning/db-knownprojects/blob/f0a14b41b4a9fa4461b2417f69766213b1b1af4f/sql/review_dob.sql#L48-L83

Solution: Append the DOB record ids to the project_record_ids in the matches table creation step

td928 commented 2 years ago

https://github.com/NYCPlanning/db-knownprojects/blob/dob-dedup-issue-360

Using this branch to work on a fix now.

Progress Update: The DOB record ids needs to be added to the project_record_ids in order for the deduplication to be successful. Current implementation of adding it review_dob does NOT address the bug.

td928 commented 2 years ago

Upon further investigation, the verified_matches table show the correct matching results for the DOB records

https://github.com/NYCPlanning/db-knownprojects/blob/440123187e2569635fd17cde4970675e191b188f/sql/project_record_ids.sql#L57-L68

On the other hand, the project_record_ids table does not show the correct joined ids. So the bug is in the following lines to create the final project_record_ids table

https://github.com/NYCPlanning/db-knownprojects/blob/440123187e2569635fd17cde4970675e191b188f/sql/project_record_ids.sql#L70-L86

At least one problematic edge case where multiple DOB records are matched in the cluster, then the expected correct behavior should be appending all DOB records into the cluster

td928 commented 2 years ago

finally located the issue in _project_record_ids table in its initial creation. For records only with DOB cluster but not others were not included at all in the _project_record_ids list. Thereof later when appending the dob records to the list by record_id_match, the dob records cannot locate matched records within the array since they do not exist in the first place.

SOLUTION: Move the INSERT INTO project_record_ids stand-alone project ahead of the update clause should address the issue.

https://github.com/NYCPlanning/db-knownprojects/blob/f0a14b41b4a9fa4461b2417f69766213b1b1af4f/sql/project_record_ids.sql#L73-L79

0220f9d does address this issue partially. However, the multiple match in the update statement remain. As it currently constructed, only one DOB record (chosen by SQL server according to this stack post) will be added to the cluster array even if more than one is matched,

UPDATE: 52b57ef fully addressed the issue where only one random DOB record will be successfully clustered by creating a array of DOB records first before joining to the non-DOB record. Now all the spatially joined DOB should be clustered.

td928 commented 2 years ago

02ed221 addresses the duplicate DOB records issues. The final results look good with clustering and deduplication. Will open pull request for this issue and close issue for now.