OHDSI / ETL-CDMBuilder

ETL-CDMBuilder is a repo containing a .NET Core application to perform ETL to OMOP CDM for multiple databases
Apache License 2.0
50 stars 32 forks source link

Duplicate visit_cost_id in visit_cost table violating primary key constraint #22

Closed ChristopheLambert closed 3 years ago

ChristopheLambert commented 7 years ago

The ETL appears to produce multiple duplicate records in the visit_cost table, violating the constraint that visit_cost_id is a unique primary key.

We first detected this after running the ETL under Windows, exporting the tables to text, importing them into a Linux PostgreSQL CDMv5 schema , then running the following constraint:

ALTER TABLE mdcr.visit_cost ADD CONSTRAINT xpk_visit_cost PRIMARY KEY ( visit_cost_id ) ;

We get the error:

psql:create_constraints.sql:158: ERROR:  could not create unique index "xpk_visit_cost"
DETAIL:  Key (visit_cost_id)=(147820) is duplicated.

Here is the duplicate data in visit_cost:

truven=> select * from visit_cost where visit_cost_id=147820;
 visit_cost_id | visit_occurrence_id | currency_concept_id | paid_copay | paid_coinsurance | paid_toward_deductible | paid_by_payer | paid_by_coordination_benefits | total_out_of_pocket |     total_paid     | payer_plan_period_id
---------------+---------------------+---------------------+------------+------------------+------------------------+---------------+-------------------------------+---------------------+--------------------+----------------------
        147820 |            11640952 |            44818668 |        0.0 |              0.0 |                    0.0 |      49914.18 |                           0.0 |                 0.0 | 50080.660000000003 |               367458
        147820 |            11640952 |            44818668 |        0.0 |              0.0 |                    0.0 |      49914.18 |                           0.0 |                 0.0 | 50080.660000000003 |               367458
(2 rows)

Note, some visit_cost records are duplicated as many as 25 times.

The ETL was interrupted under Windows after transforming the data but before creating indices and so forth that we don't need. Is there a step in the ETL that drops duplicate entries in visit_cost? All of the other tables had no problems with creating constraints.

Should it be safe to just drop all but one copy of the duplicate entries?

What we would give for a direct Linux PostgreSQL implementation of this ETL!

Thanks! Christophe

ericaVoss commented 7 years ago

@ChristopheLambert we see it on our side too . . . they do look like duplicates to me and @clairblacketer . Thank you for logging this I'll let my team know.

Hahaha . . . . sorry we try to make CDM_BUILDER friendly to others outside of JNJ but only so much we can do. :(

ChristopheLambert commented 7 years ago

Thanks, Erica. Here is some more weirdness:

After running the ETL on Truven MDCR, there are over 409M observation records where the observation_concept_id=900000010 and the observation_type_concept_id=900000009, neither of which are legitimate codes in the concept vocabulary. All of them have the observation_source_value of MHSACOVG, see some of them below:

 observation_id | person_id | observation_concept_id | observation_date | observation_time | observation_type_concept_id | value_as_number | value_as_string | value_as_concept_id | qualifier_concept_id | unit_concept_id | provider_id | visit_occurrence_id | observation_source_value | observation_source_concept_id | unit_source_value | qualifier_source_value
----------------+-----------+------------------------+------------------+------------------+-----------------------------+-----------------+-----------------+---------------------+----------------------+-----------------+-------------+---------------------+--------------------------+-------------------------------+-------------------+------------------------
          36190 |    144104 |              900000010 | 2006-01-31       |                  |                   900000009 |             1.0 |                 |                   0 |                    0 |               0 |             |                     | MHSACOVG                 |                             0 |                   |
          36191 |    144104 |              900000010 | 2006-02-28       |                  |                   900000009 |             1.0 |                 |                   0 |                    0 |               0 |             |                     | MHSACOVG                 |                             0 |                   |
          36192 |    144104 |              900000010 | 2006-03-31       |                  |                   900000009 |             1.0 |                 |                   0 |                    0 |               0 |             |                     | MHSACOVG                 |                             0 |                   |

Also in the visit_occurrence records, there are provider_id values set to zero -- missing values should be set to NULL. Similarly, care_site should not have 0 values for location_id -- just NULL. These cause problems with creating primary keys and constraints.

ericaVoss commented 7 years ago

Yeah, the document talks about it may not be 100% clear.
https://github.com/OHDSI/ETL-CDMBuilder/blob/master/man/TRUVEN_CCAE_MDCR/Truven_CCAE_and_MDCR_ETL_CDM_V5_V14.01.doc

_Additionally we add a Mental Health and Substance Abuse flags to the OBSERVATION table. In Truven in the ENROLLMENT_DETAIL table there is information about when an individual will have this type of coverage. We will add a 1 to VALUE_ASNUMBER when the coverage exists and else a 0. Make sure to take distinct records per ENROLID, DTEND, MHSACOVG.

Sometimes I beg Christian for new CONCEPT_IDs however I don't think I chased him down on this one. I'm not sure if anyone else would need this but if you think they would I can ask him for the CONCEPT_IDs and make this more formal.

INSERT INTO CONCEPT (CONCEPT_ID, CONCEPT_NAME, DOMAIN_ID, VOCABULARY_ID, CONCEPT_CLASS_ID, STANDARD_CONCEPT, CONCEPT_CODE, VALID_START_DATE, VALID_END_DATE, INVALID_REASON)
VALUES(900000009,'Mental Health Substance Abuse Coverage Indicator','Type Concept','Observation Type','Observation Type','S','OMOP generated','01/06/2013','12/31/2099','')
INSERT INTO CONCEPT (CONCEPT_ID, CONCEPT_NAME, DOMAIN_ID, VOCABULARY_ID, CONCEPT_CLASS_ID, STANDARD_CONCEPT, CONCEPT_CODE, VALID_START_DATE, VALID_END_DATE, INVALID_REASON)
VALUES(900000010,'Mental Health Substance Abuse Coverage Indicator','Observation','OMOP generated','Coverage Indicator','S','MHSACOVG','01/06/2013','12/31/2099','')

Thank you for column checking. I'll let my guys know.

ChristopheLambert commented 7 years ago

Thanks, Erica.

Truven said they started the MHSACOVG indicator in 2004, but that they retroactively added it to 2002 and 2003 data. Should the VALID_START_DATE for these concepts thus be at least 01/01/2004, if not 01/01/2002, instead of 01/06/2013?

Do you think it is safe to just drop the duplicates in visit_cost?

Kind regards, Christophe

ericaVoss commented 7 years ago

Technically that might be more correct . . . but I just give any concept I add my son's birthday. :) Unless a Vocab is date sensitive (e.g. DRG, NDC), it shouldn't matter.

Yes, it is safe to drop the duplicates in VISIT_COST.