Closed ChristopheLambert closed 3 years ago
@ChristopheLambert we see it on our side too . . . they do look like duplicates to me and @clairblacketer . Thank you for logging this I'll let my team know.
Hahaha . . . . sorry we try to make CDM_BUILDER friendly to others outside of JNJ but only so much we can do. :(
Thanks, Erica. Here is some more weirdness:
After running the ETL on Truven MDCR, there are over 409M observation records where the observation_concept_id=900000010 and the observation_type_concept_id=900000009, neither of which are legitimate codes in the concept vocabulary. All of them have the observation_source_value of MHSACOVG, see some of them below:
observation_id | person_id | observation_concept_id | observation_date | observation_time | observation_type_concept_id | value_as_number | value_as_string | value_as_concept_id | qualifier_concept_id | unit_concept_id | provider_id | visit_occurrence_id | observation_source_value | observation_source_concept_id | unit_source_value | qualifier_source_value
----------------+-----------+------------------------+------------------+------------------+-----------------------------+-----------------+-----------------+---------------------+----------------------+-----------------+-------------+---------------------+--------------------------+-------------------------------+-------------------+------------------------
36190 | 144104 | 900000010 | 2006-01-31 | | 900000009 | 1.0 | | 0 | 0 | 0 | | | MHSACOVG | 0 | |
36191 | 144104 | 900000010 | 2006-02-28 | | 900000009 | 1.0 | | 0 | 0 | 0 | | | MHSACOVG | 0 | |
36192 | 144104 | 900000010 | 2006-03-31 | | 900000009 | 1.0 | | 0 | 0 | 0 | | | MHSACOVG | 0 | |
Also in the visit_occurrence records, there are provider_id values set to zero -- missing values should be set to NULL. Similarly, care_site should not have 0 values for location_id -- just NULL. These cause problems with creating primary keys and constraints.
Yeah, the document talks about it may not be 100% clear.
https://github.com/OHDSI/ETL-CDMBuilder/blob/master/man/TRUVEN_CCAE_MDCR/Truven_CCAE_and_MDCR_ETL_CDM_V5_V14.01.doc
_Additionally we add a Mental Health and Substance Abuse flags to the OBSERVATION table. In Truven in the ENROLLMENT_DETAIL table there is information about when an individual will have this type of coverage. We will add a 1 to VALUE_ASNUMBER when the coverage exists and else a 0. Make sure to take distinct records per ENROLID, DTEND, MHSACOVG.
Sometimes I beg Christian for new CONCEPT_IDs however I don't think I chased him down on this one. I'm not sure if anyone else would need this but if you think they would I can ask him for the CONCEPT_IDs and make this more formal.
INSERT INTO CONCEPT (CONCEPT_ID, CONCEPT_NAME, DOMAIN_ID, VOCABULARY_ID, CONCEPT_CLASS_ID, STANDARD_CONCEPT, CONCEPT_CODE, VALID_START_DATE, VALID_END_DATE, INVALID_REASON)
VALUES(900000009,'Mental Health Substance Abuse Coverage Indicator','Type Concept','Observation Type','Observation Type','S','OMOP generated','01/06/2013','12/31/2099','')
INSERT INTO CONCEPT (CONCEPT_ID, CONCEPT_NAME, DOMAIN_ID, VOCABULARY_ID, CONCEPT_CLASS_ID, STANDARD_CONCEPT, CONCEPT_CODE, VALID_START_DATE, VALID_END_DATE, INVALID_REASON)
VALUES(900000010,'Mental Health Substance Abuse Coverage Indicator','Observation','OMOP generated','Coverage Indicator','S','MHSACOVG','01/06/2013','12/31/2099','')
Thank you for column checking. I'll let my guys know.
Thanks, Erica.
Truven said they started the MHSACOVG indicator in 2004, but that they retroactively added it to 2002 and 2003 data. Should the VALID_START_DATE for these concepts thus be at least 01/01/2004, if not 01/01/2002, instead of 01/06/2013?
Do you think it is safe to just drop the duplicates in visit_cost?
Kind regards, Christophe
Technically that might be more correct . . . but I just give any concept I add my son's birthday. :) Unless a Vocab is date sensitive (e.g. DRG, NDC), it shouldn't matter.
Yes, it is safe to drop the duplicates in VISIT_COST.
The ETL appears to produce multiple duplicate records in the visit_cost table, violating the constraint that visit_cost_id is a unique primary key.
We first detected this after running the ETL under Windows, exporting the tables to text, importing them into a Linux PostgreSQL CDMv5 schema , then running the following constraint:
ALTER TABLE mdcr.visit_cost ADD CONSTRAINT xpk_visit_cost PRIMARY KEY ( visit_cost_id ) ;
We get the error:
Here is the duplicate data in visit_cost:
Note, some visit_cost records are duplicated as many as 25 times.
The ETL was interrupted under Windows after transforming the data but before creating indices and so forth that we don't need. Is there a step in the ETL that drops duplicate entries in visit_cost? All of the other tables had no problems with creating constraints.
Should it be safe to just drop all but one copy of the duplicate entries?
What we would give for a direct Linux PostgreSQL implementation of this ETL!
Thanks! Christophe