OHDSI / Vocabulary-v5.0

Build process for the OHDSI Standardized Vocabularies. Currently not available as independent release.
The Unlicense
206 stars 73 forks source link

NDC codes reuse #81

Open ericaVoss opened 8 years ago

ericaVoss commented 8 years ago

Looking NDC 00069551066-cetirizine hydrochloride 10 MG Oral Tablet [Zyrtec] in the 20160311 VOCAB we have:

Looking in raw claims data, it looks like this NDC was used earlier. Here is the row counts by year in one of our large claims datasets:

2000    185381
2001    343540
2002    688628
2003    933387
2004    1108531
2005    1227832
2006    1277700
2007    80386
2008    1618
2009    33
2010    4

Addition, the net refers to this NDC earlier than 2010 - here is an example that calls this NDC as Zyrtec in 2005. http://www.mainecarepdl.org/sites/default/files/ghs-files/manufacturer-rebate/2005-05-19/mainemomanufacturerq105.pdf

I know we are getting these dates as feeds, but thought I would bring it up as an example just in case something was going on.

cgreich commented 8 years ago

@ericaVoss:

Yes, these valid dates are what we receive from our sources (mostly RxNorm API). You need them only if there are two identical codes with different validity times (reused codes). Otherwise, ignore them. I don't think there is a way to get them precise.

ericaVoss commented 8 years ago

I tried to implement this logic on top of the Vocab but figured out that there isn't one NDC in the OMOP Vocab (20160311) that is duplicated.

SELECT CONCEPT_CODE, COUNT(*)
FROM CONCEPT
WHERE VOCABULARY_ID IN ('NDC')
GROUP BY CONCEPT_CODE
HAVING COUNT(*) > 1

I feel between a rock and hard place here . . .

The only idea I have is to ignore the dates unless there are specific examples we find where we should follow the dates. This is a very manual approach but don't have any better ideas.

pbr6cornell commented 8 years ago

Could you write a query against the vocabulary to use all source-to-concept mappings without dates in the instances where the source code does not have multiple valid timestamps, and then use the timestamps in those instances where a code has multiple mappings with different valid periods?

On Tue, Jun 14, 2016 at 9:52 PM, ericaVoss notifications@github.com wrote:

I tried to implement this logic on top of the Vocab but figured out that there isn't one NDC in the OMOP Vocab (20160311) that is duplicated.

SELECT CONCEPTCODE, COUNT()FROM CONCEPTWHERE VOCABULARY_ID IN ('NDC')GROUP BY CONCEPTCODEHAVING COUNT() > 1

I feel between a rock and hard place here . . .

  • If we use the date - we are losing some potential good mappings (i.e. 00069551066-cetirizine hydrochloride 10 MG Oral Tablet [Zyrtec]). My last CCAE not using the dates had 6% unmapped NDCs and now I'm up to 11% unmapped.
  • If we don't use the date we will make some poor maps - (i.e. the old 000060005-suvorexant http://forums.ohdsi.org/t/ndcs-changing-over-time/903 example)

The only idea I have is to ignore the dates unless there are specific examples we find where we should follow the dates. This is a very manual approach but don't have any better ideas.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Vocabulary-v5.0/issues/81#issuecomment-226067303, or mute the thread https://github.com/notifications/unsubscribe/AAsrGu-OPOtq89Rxeg2f8ijLxp6MW2sZks5qL1r0gaJpZM4I04Gn .

ericaVoss commented 8 years ago

@pbr6cornell - unless I'm not clear what you are describing - this is what I wanted to do but when I went to write the query I figured out there isn't one NDC code in the OMOP Vocabulary with multiple valid periods.

cgreich commented 8 years ago

Friends:

  1. Please provide the unmapped NDC codes.
  2. Yes, we kicked them out. Please provide the "poor maps", and we try to fix them.
ericaVoss commented 8 years ago

I asked our largest claims database, what NDCs:

An additional requirement we could add is that the years need to cross into the valid years given by the CONCEPT table. For example 00002312542-Vancomycin 125 MG Oral Capsule [Vancocin] in my large claims database spans from 2000 to 2010 but the VALID_START_DATE=2007-06-01 and VALID_END_DATE=2011-02-01. What do you think of this idea?

I gathered additional information on the DRUG_SOURCE_VALUE like what the min/max year in the data and what are the min/max VALID_START/END_DATE from CONCEPT. I get 12,437 NDCs. See attached file.

NDC codes to potentially change the VALID dates on (v01).xlsx

If you think these look like good NDCs to extend the dates, would we just lower the VALID_START_DATE to something like '1900-01-01' and up the VALID_END_DATE to '12/31/2099'. If you agree - I want to patch our 20160311.

Here is the SQL ran:

WITH CTE_MISSING_DRUGS AS (
    SELECT DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE) AS [YEAR], 
COUNT(*) AS ROW_COUNT, ROW_NUMBER() OVER(PARTITION BY DRUG_SOURCE_VALUE ORDER BY DRUG_SOURCE_VALUE) AS ROW_NUM
    FROM DRUG_EXPOSURE
    WHERE DRUG_CONCEPT_ID = 0
    AND DRUG_SOURCE_VALUE NOT IN ('00000000000','99999999999')
    GROUP BY DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE)
), 
CTE_ENOUGH_YEARS_OUTSIDE AS (
    SELECT DISTINCT DRUG_SOURCE_VALUE
    FROM CTE_MISSING_DRUGS
    WHERE ROW_NUM > 4
), 
CTE_ENOUGH_DATA AS (
    SELECT DISTINCT DRUG_SOURCE_VALUE
    FROM CTE_MISSING_DRUGS
    WHERE ROW_COUNT > 100
)
SELECT md.DRUG_SOURCE_VALUE, MIN(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MIN_YEAR, MAX(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MAX_YEAR, AVG(ROW_COUNT) AS AVG_ROW_COUNT, 
    MAX(CONCEPT_NAME) AS CONCEPT_NAME, MIN(VALID_START_DATE) AS MIN_VALID_START_DATE, MAX(VALID_END_DATE) AS MAX_VALID_END_DATE
FROM CTE_MISSING_DRUGS md
    JOIN CONCEPT c
        ON c.CONCEPT_CODE = md.DRUG_SOURCE_VALUE
        AND c.VOCABULARY_ID IN ('NDC')
    JOIN DRUG_EXPOSURE de
        ON de.DRUG_SOURCE_VALUE = md.DRUG_SOURCE_VALUE
WHERE md.DRUG_SOURCE_VALUE IN (
    SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_YEARS_OUTSIDE
)
AND md.DRUG_SOURCE_VALUE IN (
    SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_DATA
)
GROUP BY md.DRUG_SOURCE_VALUE
ORDER BY md.DRUG_SOURCE_VALUE
ericaVoss commented 8 years ago

@cgreich or @pbr6cornell wondering if you had any thoughts on this approach.

pbr6cornell commented 8 years ago

@ericaVoss, I think your proposal sounds very reasonable, extending the valid period for drugs which seem in use continuously before or after the valid period to subsume the entire period of continuous use. It seems you could set the VALID_START_DATE to the minimum date of your period of continuous use (which could be first of year if you did your analysis at the year level) and VALID_END_DATE could be set to the maximum date of your period of continuous use - and set to 12/31/2099 if the period of continuous use goes all the way through the end.

On Fri, Jul 1, 2016 at 1:04 PM, ericaVoss notifications@github.com wrote:

@cgreich https://github.com/cgreich or @pbr6cornell https://github.com/pbr6cornell wondering if you had any thoughts on this approach.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Vocabulary-v5.0/issues/81#issuecomment-229998419, or mute the thread https://github.com/notifications/unsubscribe/AAsrGnEr3hYLeTyjKfCxJsdc4jFNvUZqks5qRUimgaJpZM4I04Gn .

ericaVoss commented 8 years ago

@cgreich I assume this would get implemented on your side and we would consume this update through the Vocab. Maybe there is a way to make this repeatable?

cgreich commented 8 years ago

Absolutely. You'd have to give us the list (or the code, so I could have it run at IMS as well).

ericaVoss commented 8 years ago

Sure, here is what I ran on my side.

/*Find all NDCs in a raw data source*/
/*got mapped to a DRUG_CONCEPT_ID = 0*/
/*had more than 4 years getting mapped to 0*/
/*had at least 100 rows in one of those years mapped to 0*/
/*Noe - this requires a CDM be run first - however this could be reprogrammed to run off a raw data sources too*/
WITH CTE_MISSING_DRUGS AS (
    SELECT DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE) AS [YEAR], 
COUNT(*) AS ROW_COUNT, ROW_NUMBER() OVER(PARTITION BY DRUG_SOURCE_VALUE ORDER BY DRUG_SOURCE_VALUE) AS ROW_NUM
    FROM DRUG_EXPOSURE
    WHERE DRUG_CONCEPT_ID = 0
    AND DRUG_SOURCE_VALUE NOT IN ('00000000000','99999999999')
    GROUP BY DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE)
), 
CTE_ENOUGH_YEARS_OUTSIDE AS (
    SELECT DISTINCT DRUG_SOURCE_VALUE
    FROM CTE_MISSING_DRUGS
    WHERE ROW_NUM > 4
), 
CTE_ENOUGH_DATA AS (
    SELECT DISTINCT DRUG_SOURCE_VALUE
    FROM CTE_MISSING_DRUGS
    WHERE ROW_COUNT > 100
)
SELECT md.DRUG_SOURCE_VALUE, MIN(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MIN_YEAR, MAX(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MAX_YEAR, AVG(ROW_COUNT) AS AVG_ROW_COUNT, 
    MAX(CONCEPT_NAME) AS CONCEPT_NAME, MIN(VALID_START_DATE) AS MIN_VALID_START_DATE, MAX(VALID_END_DATE) AS MAX_VALID_END_DATE
INTO NDC_PATCH
FROM CTE_MISSING_DRUGS md
    JOIN CONCEPT c
        ON c.CONCEPT_CODE = md.DRUG_SOURCE_VALUE
        AND c.VOCABULARY_ID IN ('NDC')
    JOIN DRUG_EXPOSURE de
        ON de.DRUG_SOURCE_VALUE = md.DRUG_SOURCE_VALUE
WHERE md.DRUG_SOURCE_VALUE IN (
    SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_YEARS_OUTSIDE
)
AND md.DRUG_SOURCE_VALUE IN (
    SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_DATA
)
GROUP BY md.DRUG_SOURCE_VALUE
ORDER BY md.DRUG_SOURCE_VALUE;

/*Once we find the NDCs with possible bad VALID_START/END_DATE, select better dates based on data*/
WITH CTE_DATA_YEARS AS (
SELECT p.DRUG_SOURCE_VALUE, p.CONCEPT_NAME, 
    CAST(('01/01/'+ CAST(MIN_YEAR AS char)) AS DATE) AS DATA_MIN_VALID_START_DATE,
    CAST(('01/01/'+ CAST(MAX_YEAR AS char)) AS DATE) AS DATA_MAX_VALID_END_DATE
FROM NDC_PATCH p
)
SELECT p.DRUG_SOURCE_VALUE, p.CONCEPT_NAME,
    CASE WHEN p.DATA_MIN_VALID_START_DATE < nv.valid_start_date THEN p.DATA_MIN_VALID_START_DATE ELSE nv.valid_start_date END AS NEW_VALID_START_DATE, 
    CASE WHEN p.DATA_MAX_VALID_END_DATE > nv.valid_END_date THEN p.DATA_MAX_VALID_END_DATE ELSE nv.valid_END_date END AS NEW_VALID_END_DATE
INTO NDC_PATCH_DATES_SELECTED
FROM CTE_DATA_YEARS p
    JOIN Vocabulary.dbo.CONCEPT nv
        ON nv.CONCEPT_CODE = p.DRUG_SOURCE_VALUE

/*Update the Vocab*/
UPDATE Vocabulary.dbo.CONCEPT
SET VALID_START_DATE = s.NEW_VALID_START_DATE, VALID_END_DATE = s.NEW_VALID_END_DATE
--SELECT DRUG_SOURCE_VALUE, VALID_START_DATE, NEW_VALID_START_DATE, VALID_END_DATE, NEW_VALID_END_DATE
FROM NDC_PATCH_DATES_SELECTED s
    JOIN Vocabulary.dbo.CONCEPT nv
        ON nv.concept_code = s.DRUG_SOURCE_VALUE
        AND nv.VOCABULARY_ID = 'NDC'

--12437 rows were updated using this approach

Attached are the examples of what was updated. Updated VALID dates in Vocab NDCs.xlsx

ericaVoss commented 7 years ago

@cgreich was anything like this incorporated into 20160817 or do I need to do these updates again on my side?

cgreich commented 7 years ago

Haven't gotten to it, yet.

ericaVoss commented 7 years ago

Writing a note for myself - these are some top NDCs that didn't get mapped, I might need to revisit my date pushing out logic to see if I was too strict:

NDC ROW_COUNT
00300304613 2080773
00054327099 1975690
53885024510 1270142
00045152550 1181879
00088221905 1169998
00046086781 1159599
00085128801 1148120
ericaVoss commented 6 years ago

@pbr6cornell recently performed an analysis that let me know my date pushing out logic was not pushing out the drug dates in all cases. I have re-written my logic to leverage a large claims database to inform the INVALID_START_DATE/INVALID_END_DATEs out.

WITH CTE_DRUGS_WITH_EXPOSURES_OUTSIDE AS (
    /*Use source and vocab to find NDCs of interest*/
    SELECT DISTINCT de.DRUG_SOURCE_VALUE, c.VALID_START_DATE, c.VALID_END_DATE
    FROM CDM.dbo.DRUG_EXPOSURE de
        JOIN VOCABULARY_20170503.dbo.CONCEPT c
            ON c.CONCEPT_CODE = de.DRUG_SOURCE_VALUE
            AND c.VOCABULARY_ID = 'NDC'
            AND (
                de.DRUG_EXPOSURE_START_DATE < c.VALID_START_DATE
                OR de.DRUG_EXPOSURE_START_DATE > c.VALID_END_DATE
            )
),
cteDrugTarget AS (
    /*Leveraging DRUG_ERA logic - https://gist.github.com/chrisknoll/a18c8e15ff66f26fac84 */
    SELECT de.DRUG_SOURCE_VALUE AS DRUG_SOURCE_VALUE, 
        DATEFROMPARTS(YEAR(de.DRUG_EXPOSURE_START_DATE),1,1) AS DRUG_EXPOSURE_START_DATE,
        DATEFROMPARTS(YEAR(de.DRUG_EXPOSURE_START_DATE),12,31) AS DRUG_EXPOSURE_END_DATE,
        COUNT(*) AS ROW_COUNT
    FROM CDM.dbo.DRUG_EXPOSURE de
        JOIN VOCABULARY_20170503.dbo.CONCEPT c
            ON c.CONCEPT_CODE = de.DRUG_SOURCE_VALUE
            AND c.VOCABULARY_ID = 'NDC'
    WHERE de.DRUG_SOURCE_VALUE IN (
        SELECT DRUG_SOURCE_VALUE FROM CTE_DRUGS_WITH_EXPOSURES_OUTSIDE
    )
    GROUP BY de.DRUG_SOURCE_VALUE, YEAR(de.DRUG_EXPOSURE_START_DATE)
    HAVING COUNT(*) > 100 /*Don't take years where exposure small*/
),
cteEndDates as (
    /*DRUG_ERA LOGIC*/
    select DRUG_SOURCE_VALUE, DATEADD(day,-1,EVENT_DATE) as END_DATE -- unpad the end date
    FROM
    (
        select DRUG_SOURCE_VALUE, EVENT_DATE, EVENT_TYPE, 
        MAX(START_ORDINAL) OVER (PARTITION BY DRUG_SOURCE_VALUE  ORDER BY EVENT_DATE, EVENT_TYPE ROWS UNBOUNDED PRECEDING) as START_ORDINAL, -- this pulls the current START down from the prior rows so that the NULLs from the END DATES will contain a value we can compare with 
        ROW_NUMBER() OVER (PARTITION BY DRUG_SOURCE_VALUE ORDER BY EVENT_DATE, EVENT_TYPE) AS OVERALL_ORD -- this re-numbers the inner UNION so all rows are numbered ordered by the event date
        from (
            -- select the start dates, assigning a row number to each
            Select DRUG_SOURCE_VALUE,  DRUG_EXPOSURE_START_DATE AS EVENT_DATE, -1 as EVENT_TYPE, ROW_NUMBER() OVER (PARTITION BY DRUG_SOURCE_VALUE ORDER BY DRUG_EXPOSURE_START_DATE) as START_ORDINAL
            from cteDrugTarget
            UNION ALL
            -- pad the end dates by 1 to allow a grace period for overlapping ranges.
            select DRUG_SOURCE_VALUE, DATEADD(day,1,DRUG_EXPOSURE_END_DATE), 1 as EVENT_TYPE, NULL
            FROM cteDrugTarget
        ) RAWDATA
    ) E
    WHERE (2 * E.START_ORDINAL) - E.OVERALL_ORD = 0
),
cteDrugExposureEnds  as (
    select d.DRUG_SOURCE_VALUE, d.DRUG_EXPOSURE_START_DATE, 
        MIN(e.END_DATE) as DRUG_ERA_END_DATE
    FROM cteDrugTarget d
        JOIN cteEndDates e 
            on d.DRUG_SOURCE_VALUE = e.DRUG_SOURCE_VALUE 
            and e.END_DATE >= d.DRUG_EXPOSURE_START_DATE
    GROUP BY d.DRUG_SOURCE_VALUE, 
        d.DRUG_EXPOSURE_START_DATE
),
CTE_ERAS AS (
    select DRUG_SOURCE_VALUE,  min(DRUG_EXPOSURE_START_DATE) as DRUG_ERA_START_DATE, DRUG_ERA_END_DATE, COUNT(*) as DRUG_EXPOSURE_COUNT
    from cteDrugExposureEnds d
    GROUP BY DRUG_SOURCE_VALUE,  DRUG_ERA_END_DATE
)
SELECT e.DRUG_SOURCE_VALUE, 
    CASE WHEN DRUG_ERA_START_DATE <= c.VALID_START_DATE THEN DRUG_ERA_START_DATE ELSE c.VALID_START_DATE END AS NEW_VALID_START_DATE,
    CASE WHEN DRUG_ERA_END_DATE >= c.VALID_END_DATE THEN DRUG_ERA_END_DATE ELSE c.VALID_END_DATE END AS NEW_VALID_END_DATE,
    c.VALID_START_DATE AS OLD_VALID_START_DATE,c.VALID_END_DATE AS OLD_VALID_END_DATE,
    e.DRUG_ERA_START_DATE AS CALC_DRUG_ERA_START_DATE, e.DRUG_ERA_END_DATE AS CALC_DRUG_ERA_END_DATE
FROM CTE_ERAS e
    JOIN VOCABULARY_20170503.dbo.CONCEPT c
        ON c.CONCEPT_CODE = e.DRUG_SOURCE_VALUE
        AND YEAR(c.VALID_START_DATE) BETWEEN YEAR(e.DRUG_ERA_START_DATE) AND YEAR(e.DRUG_ERA_END_DATE) /*if there are two eras at least take the one associated with what is in the Vocab*/
cgreich commented 6 years ago

@ericaVoss, friends:

Before you paste even longer SQL code I will not find will power to reverse engineer let me ask the following question:

This whole debate resulted from us having suspicion of reuse of NDC codes, correct? Because if that weren't the case we could just ignore the start and end_dates. If no, explain. If yes, I believe in THEMIS we decided to start a list of reused NDCs and publish it. Olivier has one, we can add to it.

Help my memory please.

ericaVoss commented 6 years ago

Yes, this started due to the reuse of NDC codes. I'm fine with flipping this and tracking the reused NDCs. What if in the Vocabulary we only set the valid dates for the reused ones and then for the non-reused ones we have really wide dates (1900-2099).

cgreich commented 6 years ago

@ericaVoss:

Because we might want to study how long the pipeline keeps drugs going. And what the variation is, depending on what drug. And whether label changes make a dent into that.

cgreich commented 6 years ago

@aostropolets: Can you reach out to Olivier and his list? We should put that somewhere into the vocab documentation, and a Forum post pointing to it.

ericaVoss commented 6 years ago

@cgreich but I don't think the dates in the Vocabulary for NDCs are good anyway. They seem to be way tighter than they should be.

cgreich commented 6 years ago

@ericaVoss: Why are you saying that?

ericaVoss commented 6 years ago

@cgreich I was seeing a bunch that were outside of the dates, here are some examples from Vocabulary 20171201

NDC DATA_START_DATE VOCAB VALID_START_DATE
52544017572 2007-01-01 2012-12-01
49884059201 2001-01-01 2009-10-01
00039006011 2006-01-01 2007-06-01

What my code does it build eras of exposure of the drug, as long as there is >100 persons exposed year by year I allow the dates to move out.

cgreich commented 6 years ago

@ericaVoss:

Which is brilliant. We should do a network study. The result should go to the NLM and Olivier. Like in "What the heck is going on with these? They are not part of your reuse list."

ericaVoss commented 6 years ago

@cgreich - I know I'm dragging my feet there. Mui, Vojetech, and @dimshitc have been on me!

https://github.com/OHDSI/StudyProtocolSandbox/tree/master/NDCReview

cgreich commented 6 years ago

@ericaVoss:

Perfect. Don't have to wait. We can get started and inundate Olivier. Who is on it? @dimshitc probably.

ericaVoss commented 6 years ago

So far I had: @ericaVoss @dimshitc @aostropolets Vojtech @cgreich

dimshitc commented 4 months ago

@TinyRickC137 What if we start tracking this at least? here's the query so you can produce the list of concept changed their meaning each release. and track as long as possible in the past. Then the look up can be sent to the users. It might be better than changing overall vocabulary rules as I suggested.

Note, the query might return some false-positive results, so some manual review is needed. I compared the 2017 (the oldest we have) and Jan-2023 version, and got around 900 of concepts, while around 100 are a real reuse. also I can share this table, so someone can review it and we'll have a nice list of reused NDC at least from 2017