OHDSI / Vocabulary-v5.0

Build process for the OHDSI Standardized Vocabularies. Currently not available as independent release.
The Unlicense
223 stars 75 forks source link

From Cristyn: Records are missing from concept_ancestor #17

Closed cgreich closed 9 years ago

cgreich commented 9 years ago

Hi Christian, fyi, I'm finding that records are missing from concept_ancestor. For example: $ grep "^3800361[4-6]," CONCEPT.csv 38003614,European,Race,Race,Race,S,5.01.,19700101,20991231, 38003615,Middle Eastern or North African,Race,Race,Race,S,5.02.,19700101,20991231, 38003616,Arab,Race,Race,Race,S,5.03.,19700101,20991231, $ grep ",3800361[4-6]," CONCEPT_RELATIONSHIP.csv | grep Sub 8527,38003614,Subsumes,19800101,20991231, 8527,38003616,Subsumes,19800101,20991231, 8527,38003615,Subsumes,19800101,20991231, $ grep "3800361[4-6]" CONCEPT_ANCESTOR.csv | sort 38003614,38003614,0,0 38003615,38003615,0,0 8527,38003614,1,1 8527,38003615,1,1 Why is standard concept 38003616 missing from CONCEPT_ANCESTOR.csv?

ericaVoss commented 9 years ago

Let me add to this:

FYI, this was run on VOCAB v5

/*THIS CONCEPT_ID GOES NO WHERE IN CONCEPT_ANCESTOR*/
SELECT *
FROM CONCEPT c
    JOIN CONCEPT_ANCESTOR ca
        ON ca.ancestor_concept_id = c.concept_id
    JOIN CONCEPT c2
        ON c2.concept_id = ca.descendant_concept_id
WHERE c.CONCEPT_ID = 1000619 /*Ondansetron 16 MG Oral Tablet*/

/*I CAN MAKE TWO HOPS WITH CONCEPT_REALTIONSHIP TO GET TO AN INGREDIENT*/
SELECT c.CONCEPT_ID AS START_POINT_ID, c.CONCEPT_NAME AS START_POINT_NAME, 
    c2.CONCEPT_ID AS HOP1_ID, c2.CONCEPT_NAME AS HOP1_NAME, 
    c3.CONCEPT_ID AS HOP2_ID, c3.CONCEPT_NAME AS HOP2_NAME
FROM CONCEPT c
    JOIN concept_relationship r
        ON r.concept_id_1 = c.CONCEPT_ID
    JOIN CONCEPT c2
        ON c2.CONCEPT_ID = r.CONCEPT_ID_2
        AND c2.vocabulary_id = 'RxNorm'
    JOIN concept_relationship r2
        ON r2.CONCEPT_ID_1 = c2.CONCEPT_ID
    JOIN CONCEPT c3
        ON c3.CONCEPT_ID = r2.CONCEPT_ID_2
        AND c3.vocabulary_id = 'RxNorm'
        AND c3.CONCEPT_CLASS_ID = 'Ingredient'
WHERE c.CONCEPT_ID = 1000619 /*Ondansetron 16 MG Oral Tablet*/

/*WHERE THIS INGREDIENT DOES GO TO IN CONCEPT_ANCESTOR, JUST DOESN"T INCLUDE OUR CONCEPT*/
SELECT c.CONCEPT_ID, c.CONCEPT_NAME, c2.*
FROM CONCEPT c
    JOIN concept_ancestor ca
        ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
    JOIN CONCEPT c2
        ON c2.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID
WHERE c.CONCEPT_ID = 1000560 /*Ondansetron*/
ORDER BY c2.CONCEPT_ID
ericaVoss commented 9 years ago

Took a larger approach to this:

Of the 160,313 valid RxNorm concepts, only 25% of them have a CONCEPT_ANCESTOR mapping and of those only 3% get mapped to an ingredient.

WITH CTE_SUMMARY AS (
    SELECT DISTINCT c.CONCEPT_ID, c.CONCEPT_NAME, 
        CASE WHEN ca.DESCENDANT_CONCEPT_ID IS NULL THEN 0 ELSE 1 END AS ANCESTOR_EXISTS,
        CASE WHEN c2.CONCEPT_ID IS NULL THEN 0 ELSE 1 END AS INGREDIENT_MAP_EXISTS  
    FROM CONCEPT c
        LEFT OUTER JOIN concept_ancestor ca
            ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
        LEFT OUTER JOIN CONCEPT c2
            ON c2.concept_id = ca.DESCENDANT_CONCEPT_ID
            AND c2.VOCABULARY_ID = 'RxNorm'
            AND c2.CONCEPT_CLASS_ID = 'Ingredient'
            AND c2.INVALID_REASON IS NULL
    WHERE c.VOCABULARY_ID = 'RxNorm'
    AND c.invalid_reason IS NULL
), 
CTE_ROLL_UP AS (
    SELECT CONCEPT_ID, CONCEPT_NAME, MAX(ANCESTOR_EXISTS) AS ANCESTOR_EXISTS, MAX(INGREDIENT_MAP_EXISTS) AS INGREDIENT_MAP_EXISTS  
    FROM CTE_SUMMARY
    GROUP BY CONCEPT_ID, CONCEPT_NAME
)
SELECT COUNT(*) AS RxNORM_CONCEPTS, 
    SUM(ANCESTOR_EXISTS) AS CONCEPTS_WITH_ANCESTORS, SUM(ANCESTOR_EXISTS)*1.0/(SELECT COUNT(*) FROM CTE_ROLL_UP) AS CONCEPTS_WITH_ANCESTORS_PCT,
    SUM(INGREDIENT_MAP_EXISTS) AS CONCEPTS_THAT_DRILL_TO_INGREDIENTS, SUM(INGREDIENT_MAP_EXISTS)*1.0/(SELECT COUNT(*) FROM CTE_ROLL_UP) AS INGREDIENT_MAP_EXISTS_PCT
FROM CTE_ROLL_UP

I compared this with the previous VOCAB (OMOP Vocabulary v4.4 2013-10-15):

Of the 157,626 valid RxNorm concepts, 100% had CONCEPT_ANCESTORS, and 6% of those get down to an ingredient.

cgreich commented 9 years ago

Fixed