EHDEN / ETL-UK-Biobank

ETL UK-Biobank
https://ehden.github.io/ETL-UK-Biobank/
13 stars 4 forks source link

Primary key visit occurrence failing (2021-03-16) #263

Closed MaximMoinat closed 3 years ago

MaximMoinat commented 3 years ago
2021-03-16 23:56:10,642 - INFO - Unable to add pk_visit_occurrence
MaximMoinat commented 3 years ago

From DQD, isPrimaryKey, 76,108 unique visit_occurrence_ids are duplicated (1.51% of all visit records).

To understand where the duplicates are coming from, we can execute the following query:

SELECT cdmTable.data_source,
       count(*),
       sum(CASE WHEN cdmTable.VISIT_OCCURRENCE_ID IN (
           SELECT VISIT_OCCURRENCE_ID
           FROM omopcdm.VISIT_OCCURRENCE
           GROUP BY VISIT_OCCURRENCE_ID
           HAVING COUNT(*) > 1
       ) THEN 1 ELSE 0 END) AS n_pk_duplicates
FROM omopcdm.VISIT_OCCURRENCE cdmTable
GROUP BY cdmTable.data_source;
MaximMoinat commented 3 years ago

Examination of real data shows that PR #274 indeed solves this issue.

All pk clashes are in gp_covid.

data_source count n_pk_duplicates
covid19 gp_emis 2452889 38054
covid19 gp_tpp 1299931 38054