Closed mostafaalishahi closed 8 months ago
Hi, thanks for pointing this out. I think the issue comes with downcasting of measurement_datetime
.
There is no duplicated data when using measurement_date
and measurement_time
as the time component
import pandas as pd
measurement_0 = pd.read_parquet(r'D:/BLENDED_ICU/blended_data/OMOP-CDM/measurement/MEASUREMENT_0.parquet')
measurement_1 = pd.read_parquet(r'D:/BLENDED_ICU/blended_data/OMOP-CDM/measurement/MEASUREMENT_1.parquet')
df = pd.concat([measurement_0, measurement_1], axis=0)
primarykey = df[['measurement_date',
'measurement_time',
'visit_occurrence_id',
'measurement_concept_id']]
dupli = primarykey.duplicated()
dupli.sum()
>> 0
But there are when using measurement_datetime
primarykey = df[['measurement_datetime',
'visit_occurrence_id',
'measurement_concept_id']]
dupli = primarykey.duplicated()
dupli.sum()
>>73168825
In fact the measurement_datetime
column once saved in parquet is equal to the measurement_date
. I'm fixing this very soon.
The data seems to be fine, a quick fix would be to omit measurement_datetime
and use measurement_date
and measurement_time
as the time component.
Please tell me if this fixes the duplications for you.
This should be fixed in v0.3.1
There are duplications in measurement and drug_exposure tables in OMOP, we have around 300 million duplicated rows in the measurement table and about 3 million duplicated rows in the drug_exposure table.