Open td928 opened 1 year ago
okay figured out what the issue is.
the procedure to do the combine
action in manual project is the following
CREATE OR REPLACE PROCEDURE apply_combine(
record_id text,
record_id_match text
) AS $$
DECLARE
new_record_ids text[];
BEGIN
SELECT
array_agg(rid) AS record_ids
INTO new_record_ids
FROM (
SELECT 1 as col, unnest(b.project_record_ids) as rid
FROM _project_record_ids b
WHERE record_id = any(b.project_record_ids)
OR record_id_match = any(b.project_record_ids)
) a GROUP BY col;
DELETE FROM _project_record_ids
WHERE record_id_match=any(project_record_ids);
UPDATE _project_record_ids
SET project_record_ids = new_record_ids
WHERE record_id = any(project_record_ids);
END
$$ LANGUAGE plpgsql;
In plain language, what it does is that it looks for the cluster that the record_id
is part of and also cluster that record_id_match
is part of and then merge those clusters together into one by including all the records.
This causes an unexpected behavior in our edge case where the record_id
cluster no longer exits and record_id_match
cluster is the same before and after the merging step. Then the DELETE command would then remove the entire cluster before trying to add the new cluster new_record_ids
back in but since record_id
is outdated so it cannot do this and we lost the entire cluster.
Okay I think this makes sense to me, I feel like I need to read this a few times over and then maybe we can chat about it this example
Stale issue message
Stale issue message
We came across a cluster group (see screenshot attached) which the dbscan and our logics correctly identified as clusters. But after the manual corrections step at
psql $BUILD_ENGINE -1 -f sql/correct_projects.sql
the cluster is broken up despite the only operations manual corrections targeting those specific records are joining them together and speficially
2021X0133,reassign,P2017X0037
29abcba6f9c4db82627ac82f70d19bed,combine,2021X0133
One hypothesis could be the second combine corrections the uid is outdated. But the expected behavior probably should still not be breaking up existing cluster just because one of the correction is not applicable.