Closed MaximMoinat closed 3 years ago
From DQD, isPrimaryKey, 76,108 unique visit_occurrence_ids are duplicated (1.51% of all visit records).
To understand where the duplicates are coming from, we can execute the following query:
SELECT cdmTable.data_source,
count(*),
sum(CASE WHEN cdmTable.VISIT_OCCURRENCE_ID IN (
SELECT VISIT_OCCURRENCE_ID
FROM omopcdm.VISIT_OCCURRENCE
GROUP BY VISIT_OCCURRENCE_ID
HAVING COUNT(*) > 1
) THEN 1 ELSE 0 END) AS n_pk_duplicates
FROM omopcdm.VISIT_OCCURRENCE cdmTable
GROUP BY cdmTable.data_source;
Examination of real data shows that PR #274 indeed solves this issue.
All pk clashes are in gp_covid.
data_source | count | n_pk_duplicates |
---|---|---|
covid19 gp_emis | 2452889 | 38054 |
covid19 gp_tpp | 1299931 | 38054 |