EHDEN / ETL-UK-Biobank

ETL UK-Biobank
https://ehden.github.io/ETL-UK-Biobank/
12 stars 4 forks source link

High frequency of "MCHC - Mean corpuscular haemoglobin concentration" #292

Closed MaximMoinat closed 3 years ago

MaximMoinat commented 3 years ago

In the measurement table, the concept MCHC - Mean corpuscular haemoglobin concentration (37393850) occurs 1.5M times, more than ten times the second most occurring concept (Plasma HDL level).

We should research what source codes and fields maps to this concept_id.

MaximMoinat commented 3 years ago

Counting all the source codes for target concept 37393850 gives:

measurement_source_value | measurement_source_concept_id | data_source | count -- | -- | -- | -- 1022481000000109 | 37393850 | covid19 gp_emis | 1548925 429.. | 45488665 | GP-1 | 3279 429.. | 45488665 | GP-2 | 2919 429.. | 45488665 | GP-3 | 16955 429.. | 45488665 | GP-4 | 901

The code 1022481000000109 from covid19 gp_emis is the SNOMED code of "MCHC - Mean corpuscular haemoglobin concentration". The ETL maps 1.5M records of this code to the measurement table. To be investigated whether this correct, a source data or an ETL issue.

MaximMoinat commented 3 years ago

Checking the covid19 gp_emis source confirmed that indeed almost half the records are indeed have this code for MCHC.

The file has 3,304,808 lines in total from which 1,553,334 contains code '1022481000000109'.

So the OMOP mapping is correct, it reflects the source data. The ETL even filters out some of the records.

This high frequency might be a source data quality issue.