NYCPlanning / db-knownprojects

KPDB: A compilation of prospective residential development projects from various sources, with rough projections of new unit counts
https://nycplanning.github.io/db-knownprojects
0 stars 0 forks source link

manual corrections causing clusters to be broken up #376

Open td928 opened 1 year ago

td928 commented 1 year ago

We came across a cluster group (see screenshot attached) which the dbscan and our logics correctly identified as clusters. But after the manual corrections step at

psql $BUILD_ENGINE -1 -f sql/correct_projects.sql

the cluster is broken up despite the only operations manual corrections targeting those specific records are joining them together and speficially

2021X0133,reassign,P2017X0037

29abcba6f9c4db82627ac82f70d19bed,combine,2021X0133

One hypothesis could be the second combine corrections the uid is outdated. But the expected behavior probably should still not be breaking up existing cluster just because one of the correction is not applicable.

Screenshot (7)

td928 commented 1 year ago

okay figured out what the issue is.

the procedure to do the combine action in manual project is the following


CREATE OR REPLACE PROCEDURE apply_combine(
    record_id text, 
    record_id_match text
) AS $$
DECLARE
    new_record_ids text[];
BEGIN

    SELECT 
        array_agg(rid) AS record_ids
    INTO new_record_ids
    FROM (
        SELECT 1 as col, unnest(b.project_record_ids) as rid
        FROM _project_record_ids b 
        WHERE record_id = any(b.project_record_ids)
        OR record_id_match = any(b.project_record_ids)
    ) a GROUP BY col;

    DELETE FROM _project_record_ids
    WHERE record_id_match=any(project_record_ids);

    UPDATE _project_record_ids
    SET project_record_ids = new_record_ids
    WHERE record_id = any(project_record_ids);
END
$$ LANGUAGE plpgsql;

In plain language, what it does is that it looks for the cluster that the record_id is part of and also cluster that record_id_match is part of and then merge those clusters together into one by including all the records.

This causes an unexpected behavior in our edge case where the record_id cluster no longer exits and record_id_match cluster is the same before and after the merging step. Then the DELETE command would then remove the entire cluster before trying to add the new cluster new_record_ids back in but since record_id is outdated so it cannot do this and we lost the entire cluster.

mbh329 commented 1 year ago

Okay I think this makes sense to me, I feel like I need to read this a few times over and then maybe we can chat about it this example

github-actions[bot] commented 1 year ago

Stale issue message

github-actions[bot] commented 1 year ago

Stale issue message