duplications in measurement and drug_exposure

mostafaalishahi commented 8 months ago

There are duplications in measurement and drug_exposure tables in OMOP, we have around 300 million duplicated rows in the measurement table and about 3 million duplicated rows in the drug_exposure table.

USM-CHU-FGuyon commented 8 months ago

Hi, thanks for pointing this out. I think the issue comes with downcasting of measurement_datetime.

There is no duplicated data when using measurement_date and measurement_time as the time component

import pandas as pd

measurement_0 = pd.read_parquet(r'D:/BLENDED_ICU/blended_data/OMOP-CDM/measurement/MEASUREMENT_0.parquet')
measurement_1 = pd.read_parquet(r'D:/BLENDED_ICU/blended_data/OMOP-CDM/measurement/MEASUREMENT_1.parquet')

df = pd.concat([measurement_0, measurement_1], axis=0)

primarykey = df[['measurement_date',
                 'measurement_time',
                 'visit_occurrence_id',
                 'measurement_concept_id']]

dupli = primarykey.duplicated()
dupli.sum()
>> 0

But there are when using measurement_datetime

primarykey = df[['measurement_datetime',
                 'visit_occurrence_id',
                 'measurement_concept_id']]

dupli = primarykey.duplicated()
dupli.sum()
>>73168825

In fact the measurement_datetime column once saved in parquet is equal to the measurement_date. I'm fixing this very soon.

The data seems to be fine, a quick fix would be to omit measurement_datetime and use measurement_date and measurement_time as the time component.

Please tell me if this fixes the duplications for you.

USM-CHU-FGuyon commented 8 months ago

This should be fixed in v0.3.1

USM-CHU-FGuyon / BlendedICU

duplications in measurement and drug_exposure #26