callahantiff / PheKnowVec

Translational Computational Phenotyping
2 stars 0 forks source link

OTHER: Verify query to map source vocabulary code strings to an OMOP standard terminology concept code #29

Closed callahantiff closed 5 years ago

callahantiff commented 5 years ago

@mgkahn - Can you please verify the following query is correct?

Purpose: This query is designed to find exact matches between source vocabulary code strings (e.g. ICD9CM 428) and an OMOP standard terminology concept code.


SELECT DISTINCT
c.concept_code AS source_code,
c.concept_name AS source_name,
c.domain_id AS source_domain_id,
c.vocabulary_id AS source_vocabulary,
c1.concept_code AS standard_code,
c1.concept_name AS standard_name,
c1.domain_id AS standard_domain_id,
c1.vocabulary_id AS standard_vocabulary,
c1.standard_concept, c1.invalid_reason

FROM CHCO_DeID_Oct2018.concept c
JOIN CHCO_DeID_Oct2018.concept_relationship r
ON r.concept_id_1 = c.concept_id
JOIN CHCO_DeID_Oct2018.concept c1
ON r.concept_id_2 = c1.concept_id

WHERE c.concept_code IN ("314.01", "296")
AND c.vocabulary_id = "ICD9CM"
AND r.relationship_id = "Maps to" 
AND c1.vocabulary_id = "SNOMED"
AND c1.standard_concept = "S"
AND c1.invalid_reason= "";

Result:

source_code source_name source_domain_id source_vocabulary standard_code standard_name standard_domain_id standard_vocabulary standard_concept invalid_reason
296 Episodic mood disorders Condition ICD9CM 13746004 Bipolar disorder Condition SNOMED S  
314.01 Attention deficit disorder with hyperactivity Condition ICD9CM 406506008 Attention deficit hyperactivity disorder Condition SNOMED S  
mgkahn commented 5 years ago

The query above is correct for ETL. It may not be what you want to use to obtain standard concept_ids from ICD9CM codes in a query. In particular, on the query side, the very last clause in your WHERE statement: c1.invalid_reason = "" will return only those standard concept_ids that are valid on the day the query is run. But, if you are querying OMOP data that is ETL'd incrementally over time rather than as a full-dump (HDC uses incremental ETLs), then there will be data from older ETL cycles that are mapped to standard concept_ids that were valid then but are invalid now. The SQL above will not return currently invalid concept_ids.

In your example, removing the last clause did NOT change the results. But with medications, which change frequently, I suspect the results would vary over time/ETL cycles.

callahantiff commented 5 years ago

The query above is correct for ETL. It may not be what you want to use to obtain standard concept_ids from ICD9CM codes in a query. In particular, on the query side, the very last clause in your WHERE statement: c1.invalid_reason = "" will return only those standard concept_ids that are valid on the day the query is run. But, if you are querying OMOP data that is ETL'd incrementally over time rather than as a full-dump (HDC uses incremental ETLs), then there will be data from older ETL cycles that are mapped to standard concept_ids that were valid then but are invalid now. The SQL above will not return currently invalid concept_ids.

In your example, removing the last clause did NOT change the results. But with medications, which change frequently, I suspect the results would vary over time/ETL cycles.


Response:

Updated query shown for consistency:

SELECT DISTINCT
c.concept_code AS source_code,
c.concept_name AS source_name,
c.domain_id AS source_domain_id,
c.vocabulary_id AS source_vocabulary,
c1.concept_code AS standard_code,
c1.concept_name AS standard_name,
c1.domain_id AS standard_domain_id,
c1.vocabulary_id AS standard_vocabulary,
c1.standard_concept, c1.invalid_reason

FROM CHCO_DeID_Oct2018.concept c
JOIN CHCO_DeID_Oct2018.concept_relationship r
ON r.concept_id_1 = c.concept_id
JOIN CHCO_DeID_Oct2018.concept c1
ON r.concept_id_2 = c1.concept_id

WHERE c.concept_code IN ("314.01", "296")
AND c.vocabulary_id = "ICD9CM"
AND r.relationship_id = "Maps to" 
AND c1.vocabulary_id = "SNOMED"
AND c1.standard_concept = "S";
mgkahn commented 5 years ago

Just highlighting that a query for ETL may not be what you want for a query. For ETL, the restriction to only valid concepts makes sense. But for a query against longitudinal data that is added incrementally, you need to consider how to handle deprecated concept_ids that will persist in the data warehouse. That's the point I was making. I would include the restriction for ETL processing and not include the restriction for queries.

callahantiff commented 5 years ago

OK, thank you for explaining that! I think we are good on this so I will close the issue.