Open ericaVoss opened 8 years ago
@ericaVoss:
Yes, these valid dates are what we receive from our sources (mostly RxNorm API). You need them only if there are two identical codes with different validity times (reused codes). Otherwise, ignore them. I don't think there is a way to get them precise.
I tried to implement this logic on top of the Vocab but figured out that there isn't one NDC in the OMOP Vocab (20160311) that is duplicated.
SELECT CONCEPT_CODE, COUNT(*)
FROM CONCEPT
WHERE VOCABULARY_ID IN ('NDC')
GROUP BY CONCEPT_CODE
HAVING COUNT(*) > 1
I feel between a rock and hard place here . . .
00069551066-cetirizine hydrochloride 10 MG Oral Tablet [Zyrtec]
). My last CCAE not using the dates had 6% unmapped NDCs and now I'm up to 11% unmapped.000060005-suvorexant
example)The only idea I have is to ignore the dates unless there are specific examples we find where we should follow the dates. This is a very manual approach but don't have any better ideas.
Could you write a query against the vocabulary to use all source-to-concept mappings without dates in the instances where the source code does not have multiple valid timestamps, and then use the timestamps in those instances where a code has multiple mappings with different valid periods?
On Tue, Jun 14, 2016 at 9:52 PM, ericaVoss notifications@github.com wrote:
I tried to implement this logic on top of the Vocab but figured out that there isn't one NDC in the OMOP Vocab (20160311) that is duplicated.
SELECT CONCEPTCODE, COUNT()FROM CONCEPTWHERE VOCABULARY_ID IN ('NDC')GROUP BY CONCEPTCODEHAVING COUNT() > 1
I feel between a rock and hard place here . . .
- If we use the date - we are losing some potential good mappings (i.e. 00069551066-cetirizine hydrochloride 10 MG Oral Tablet [Zyrtec]). My last CCAE not using the dates had 6% unmapped NDCs and now I'm up to 11% unmapped.
- If we don't use the date we will make some poor maps - (i.e. the old 000060005-suvorexant http://forums.ohdsi.org/t/ndcs-changing-over-time/903 example)
The only idea I have is to ignore the dates unless there are specific examples we find where we should follow the dates. This is a very manual approach but don't have any better ideas.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Vocabulary-v5.0/issues/81#issuecomment-226067303, or mute the thread https://github.com/notifications/unsubscribe/AAsrGu-OPOtq89Rxeg2f8ijLxp6MW2sZks5qL1r0gaJpZM4I04Gn .
@pbr6cornell - unless I'm not clear what you are describing - this is what I wanted to do but when I went to write the query I figured out there isn't one NDC code in the OMOP Vocabulary with multiple valid periods.
Friends:
I asked our largest claims database, what NDCs:
An additional requirement we could add is that the years need to cross into the valid years given by the CONCEPT table. For example 00002312542-Vancomycin 125 MG Oral Capsule [Vancocin]
in my large claims database spans from 2000 to 2010 but the VALID_START_DATE=2007-06-01 and VALID_END_DATE=2011-02-01. What do you think of this idea?
I gathered additional information on the DRUG_SOURCE_VALUE like what the min/max year in the data and what are the min/max VALID_START/END_DATE from CONCEPT. I get 12,437 NDCs. See attached file.
NDC codes to potentially change the VALID dates on (v01).xlsx
If you think these look like good NDCs to extend the dates, would we just lower the VALID_START_DATE to something like '1900-01-01' and up the VALID_END_DATE to '12/31/2099'. If you agree - I want to patch our 20160311.
Here is the SQL ran:
WITH CTE_MISSING_DRUGS AS (
SELECT DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE) AS [YEAR],
COUNT(*) AS ROW_COUNT, ROW_NUMBER() OVER(PARTITION BY DRUG_SOURCE_VALUE ORDER BY DRUG_SOURCE_VALUE) AS ROW_NUM
FROM DRUG_EXPOSURE
WHERE DRUG_CONCEPT_ID = 0
AND DRUG_SOURCE_VALUE NOT IN ('00000000000','99999999999')
GROUP BY DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE)
),
CTE_ENOUGH_YEARS_OUTSIDE AS (
SELECT DISTINCT DRUG_SOURCE_VALUE
FROM CTE_MISSING_DRUGS
WHERE ROW_NUM > 4
),
CTE_ENOUGH_DATA AS (
SELECT DISTINCT DRUG_SOURCE_VALUE
FROM CTE_MISSING_DRUGS
WHERE ROW_COUNT > 100
)
SELECT md.DRUG_SOURCE_VALUE, MIN(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MIN_YEAR, MAX(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MAX_YEAR, AVG(ROW_COUNT) AS AVG_ROW_COUNT,
MAX(CONCEPT_NAME) AS CONCEPT_NAME, MIN(VALID_START_DATE) AS MIN_VALID_START_DATE, MAX(VALID_END_DATE) AS MAX_VALID_END_DATE
FROM CTE_MISSING_DRUGS md
JOIN CONCEPT c
ON c.CONCEPT_CODE = md.DRUG_SOURCE_VALUE
AND c.VOCABULARY_ID IN ('NDC')
JOIN DRUG_EXPOSURE de
ON de.DRUG_SOURCE_VALUE = md.DRUG_SOURCE_VALUE
WHERE md.DRUG_SOURCE_VALUE IN (
SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_YEARS_OUTSIDE
)
AND md.DRUG_SOURCE_VALUE IN (
SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_DATA
)
GROUP BY md.DRUG_SOURCE_VALUE
ORDER BY md.DRUG_SOURCE_VALUE
@cgreich or @pbr6cornell wondering if you had any thoughts on this approach.
@ericaVoss, I think your proposal sounds very reasonable, extending the valid period for drugs which seem in use continuously before or after the valid period to subsume the entire period of continuous use. It seems you could set the VALID_START_DATE to the minimum date of your period of continuous use (which could be first of year if you did your analysis at the year level) and VALID_END_DATE could be set to the maximum date of your period of continuous use - and set to 12/31/2099 if the period of continuous use goes all the way through the end.
On Fri, Jul 1, 2016 at 1:04 PM, ericaVoss notifications@github.com wrote:
@cgreich https://github.com/cgreich or @pbr6cornell https://github.com/pbr6cornell wondering if you had any thoughts on this approach.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OHDSI/Vocabulary-v5.0/issues/81#issuecomment-229998419, or mute the thread https://github.com/notifications/unsubscribe/AAsrGnEr3hYLeTyjKfCxJsdc4jFNvUZqks5qRUimgaJpZM4I04Gn .
@cgreich I assume this would get implemented on your side and we would consume this update through the Vocab. Maybe there is a way to make this repeatable?
Absolutely. You'd have to give us the list (or the code, so I could have it run at IMS as well).
Sure, here is what I ran on my side.
/*Find all NDCs in a raw data source*/
/*got mapped to a DRUG_CONCEPT_ID = 0*/
/*had more than 4 years getting mapped to 0*/
/*had at least 100 rows in one of those years mapped to 0*/
/*Noe - this requires a CDM be run first - however this could be reprogrammed to run off a raw data sources too*/
WITH CTE_MISSING_DRUGS AS (
SELECT DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE) AS [YEAR],
COUNT(*) AS ROW_COUNT, ROW_NUMBER() OVER(PARTITION BY DRUG_SOURCE_VALUE ORDER BY DRUG_SOURCE_VALUE) AS ROW_NUM
FROM DRUG_EXPOSURE
WHERE DRUG_CONCEPT_ID = 0
AND DRUG_SOURCE_VALUE NOT IN ('00000000000','99999999999')
GROUP BY DRUG_SOURCE_VALUE, YEAR(DRUG_EXPOSURE_START_DATE)
),
CTE_ENOUGH_YEARS_OUTSIDE AS (
SELECT DISTINCT DRUG_SOURCE_VALUE
FROM CTE_MISSING_DRUGS
WHERE ROW_NUM > 4
),
CTE_ENOUGH_DATA AS (
SELECT DISTINCT DRUG_SOURCE_VALUE
FROM CTE_MISSING_DRUGS
WHERE ROW_COUNT > 100
)
SELECT md.DRUG_SOURCE_VALUE, MIN(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MIN_YEAR, MAX(YEAR(de.DRUG_EXPOSURE_START_DATE)) AS MAX_YEAR, AVG(ROW_COUNT) AS AVG_ROW_COUNT,
MAX(CONCEPT_NAME) AS CONCEPT_NAME, MIN(VALID_START_DATE) AS MIN_VALID_START_DATE, MAX(VALID_END_DATE) AS MAX_VALID_END_DATE
INTO NDC_PATCH
FROM CTE_MISSING_DRUGS md
JOIN CONCEPT c
ON c.CONCEPT_CODE = md.DRUG_SOURCE_VALUE
AND c.VOCABULARY_ID IN ('NDC')
JOIN DRUG_EXPOSURE de
ON de.DRUG_SOURCE_VALUE = md.DRUG_SOURCE_VALUE
WHERE md.DRUG_SOURCE_VALUE IN (
SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_YEARS_OUTSIDE
)
AND md.DRUG_SOURCE_VALUE IN (
SELECT DRUG_SOURCE_VALUE FROM CTE_ENOUGH_DATA
)
GROUP BY md.DRUG_SOURCE_VALUE
ORDER BY md.DRUG_SOURCE_VALUE;
/*Once we find the NDCs with possible bad VALID_START/END_DATE, select better dates based on data*/
WITH CTE_DATA_YEARS AS (
SELECT p.DRUG_SOURCE_VALUE, p.CONCEPT_NAME,
CAST(('01/01/'+ CAST(MIN_YEAR AS char)) AS DATE) AS DATA_MIN_VALID_START_DATE,
CAST(('01/01/'+ CAST(MAX_YEAR AS char)) AS DATE) AS DATA_MAX_VALID_END_DATE
FROM NDC_PATCH p
)
SELECT p.DRUG_SOURCE_VALUE, p.CONCEPT_NAME,
CASE WHEN p.DATA_MIN_VALID_START_DATE < nv.valid_start_date THEN p.DATA_MIN_VALID_START_DATE ELSE nv.valid_start_date END AS NEW_VALID_START_DATE,
CASE WHEN p.DATA_MAX_VALID_END_DATE > nv.valid_END_date THEN p.DATA_MAX_VALID_END_DATE ELSE nv.valid_END_date END AS NEW_VALID_END_DATE
INTO NDC_PATCH_DATES_SELECTED
FROM CTE_DATA_YEARS p
JOIN Vocabulary.dbo.CONCEPT nv
ON nv.CONCEPT_CODE = p.DRUG_SOURCE_VALUE
/*Update the Vocab*/
UPDATE Vocabulary.dbo.CONCEPT
SET VALID_START_DATE = s.NEW_VALID_START_DATE, VALID_END_DATE = s.NEW_VALID_END_DATE
--SELECT DRUG_SOURCE_VALUE, VALID_START_DATE, NEW_VALID_START_DATE, VALID_END_DATE, NEW_VALID_END_DATE
FROM NDC_PATCH_DATES_SELECTED s
JOIN Vocabulary.dbo.CONCEPT nv
ON nv.concept_code = s.DRUG_SOURCE_VALUE
AND nv.VOCABULARY_ID = 'NDC'
--12437 rows were updated using this approach
Attached are the examples of what was updated. Updated VALID dates in Vocab NDCs.xlsx
@cgreich was anything like this incorporated into 20160817 or do I need to do these updates again on my side?
Haven't gotten to it, yet.
Writing a note for myself - these are some top NDCs that didn't get mapped, I might need to revisit my date pushing out logic to see if I was too strict:
NDC ROW_COUNT
00300304613 2080773
00054327099 1975690
53885024510 1270142
00045152550 1181879
00088221905 1169998
00046086781 1159599
00085128801 1148120
@pbr6cornell recently performed an analysis that let me know my date pushing out logic was not pushing out the drug dates in all cases. I have re-written my logic to leverage a large claims database to inform the INVALID_START_DATE/INVALID_END_DATEs out.
WITH CTE_DRUGS_WITH_EXPOSURES_OUTSIDE AS (
/*Use source and vocab to find NDCs of interest*/
SELECT DISTINCT de.DRUG_SOURCE_VALUE, c.VALID_START_DATE, c.VALID_END_DATE
FROM CDM.dbo.DRUG_EXPOSURE de
JOIN VOCABULARY_20170503.dbo.CONCEPT c
ON c.CONCEPT_CODE = de.DRUG_SOURCE_VALUE
AND c.VOCABULARY_ID = 'NDC'
AND (
de.DRUG_EXPOSURE_START_DATE < c.VALID_START_DATE
OR de.DRUG_EXPOSURE_START_DATE > c.VALID_END_DATE
)
),
cteDrugTarget AS (
/*Leveraging DRUG_ERA logic - https://gist.github.com/chrisknoll/a18c8e15ff66f26fac84 */
SELECT de.DRUG_SOURCE_VALUE AS DRUG_SOURCE_VALUE,
DATEFROMPARTS(YEAR(de.DRUG_EXPOSURE_START_DATE),1,1) AS DRUG_EXPOSURE_START_DATE,
DATEFROMPARTS(YEAR(de.DRUG_EXPOSURE_START_DATE),12,31) AS DRUG_EXPOSURE_END_DATE,
COUNT(*) AS ROW_COUNT
FROM CDM.dbo.DRUG_EXPOSURE de
JOIN VOCABULARY_20170503.dbo.CONCEPT c
ON c.CONCEPT_CODE = de.DRUG_SOURCE_VALUE
AND c.VOCABULARY_ID = 'NDC'
WHERE de.DRUG_SOURCE_VALUE IN (
SELECT DRUG_SOURCE_VALUE FROM CTE_DRUGS_WITH_EXPOSURES_OUTSIDE
)
GROUP BY de.DRUG_SOURCE_VALUE, YEAR(de.DRUG_EXPOSURE_START_DATE)
HAVING COUNT(*) > 100 /*Don't take years where exposure small*/
),
cteEndDates as (
/*DRUG_ERA LOGIC*/
select DRUG_SOURCE_VALUE, DATEADD(day,-1,EVENT_DATE) as END_DATE -- unpad the end date
FROM
(
select DRUG_SOURCE_VALUE, EVENT_DATE, EVENT_TYPE,
MAX(START_ORDINAL) OVER (PARTITION BY DRUG_SOURCE_VALUE ORDER BY EVENT_DATE, EVENT_TYPE ROWS UNBOUNDED PRECEDING) as START_ORDINAL, -- this pulls the current START down from the prior rows so that the NULLs from the END DATES will contain a value we can compare with
ROW_NUMBER() OVER (PARTITION BY DRUG_SOURCE_VALUE ORDER BY EVENT_DATE, EVENT_TYPE) AS OVERALL_ORD -- this re-numbers the inner UNION so all rows are numbered ordered by the event date
from (
-- select the start dates, assigning a row number to each
Select DRUG_SOURCE_VALUE, DRUG_EXPOSURE_START_DATE AS EVENT_DATE, -1 as EVENT_TYPE, ROW_NUMBER() OVER (PARTITION BY DRUG_SOURCE_VALUE ORDER BY DRUG_EXPOSURE_START_DATE) as START_ORDINAL
from cteDrugTarget
UNION ALL
-- pad the end dates by 1 to allow a grace period for overlapping ranges.
select DRUG_SOURCE_VALUE, DATEADD(day,1,DRUG_EXPOSURE_END_DATE), 1 as EVENT_TYPE, NULL
FROM cteDrugTarget
) RAWDATA
) E
WHERE (2 * E.START_ORDINAL) - E.OVERALL_ORD = 0
),
cteDrugExposureEnds as (
select d.DRUG_SOURCE_VALUE, d.DRUG_EXPOSURE_START_DATE,
MIN(e.END_DATE) as DRUG_ERA_END_DATE
FROM cteDrugTarget d
JOIN cteEndDates e
on d.DRUG_SOURCE_VALUE = e.DRUG_SOURCE_VALUE
and e.END_DATE >= d.DRUG_EXPOSURE_START_DATE
GROUP BY d.DRUG_SOURCE_VALUE,
d.DRUG_EXPOSURE_START_DATE
),
CTE_ERAS AS (
select DRUG_SOURCE_VALUE, min(DRUG_EXPOSURE_START_DATE) as DRUG_ERA_START_DATE, DRUG_ERA_END_DATE, COUNT(*) as DRUG_EXPOSURE_COUNT
from cteDrugExposureEnds d
GROUP BY DRUG_SOURCE_VALUE, DRUG_ERA_END_DATE
)
SELECT e.DRUG_SOURCE_VALUE,
CASE WHEN DRUG_ERA_START_DATE <= c.VALID_START_DATE THEN DRUG_ERA_START_DATE ELSE c.VALID_START_DATE END AS NEW_VALID_START_DATE,
CASE WHEN DRUG_ERA_END_DATE >= c.VALID_END_DATE THEN DRUG_ERA_END_DATE ELSE c.VALID_END_DATE END AS NEW_VALID_END_DATE,
c.VALID_START_DATE AS OLD_VALID_START_DATE,c.VALID_END_DATE AS OLD_VALID_END_DATE,
e.DRUG_ERA_START_DATE AS CALC_DRUG_ERA_START_DATE, e.DRUG_ERA_END_DATE AS CALC_DRUG_ERA_END_DATE
FROM CTE_ERAS e
JOIN VOCABULARY_20170503.dbo.CONCEPT c
ON c.CONCEPT_CODE = e.DRUG_SOURCE_VALUE
AND YEAR(c.VALID_START_DATE) BETWEEN YEAR(e.DRUG_ERA_START_DATE) AND YEAR(e.DRUG_ERA_END_DATE) /*if there are two eras at least take the one associated with what is in the Vocab*/
@ericaVoss, friends:
Before you paste even longer SQL code I will not find will power to reverse engineer let me ask the following question:
This whole debate resulted from us having suspicion of reuse of NDC codes, correct? Because if that weren't the case we could just ignore the start and end_dates. If no, explain. If yes, I believe in THEMIS we decided to start a list of reused NDCs and publish it. Olivier has one, we can add to it.
Help my memory please.
Yes, this started due to the reuse of NDC codes. I'm fine with flipping this and tracking the reused NDCs. What if in the Vocabulary we only set the valid dates for the reused ones and then for the non-reused ones we have really wide dates (1900-2099).
@ericaVoss:
Because we might want to study how long the pipeline keeps drugs going. And what the variation is, depending on what drug. And whether label changes make a dent into that.
@aostropolets: Can you reach out to Olivier and his list? We should put that somewhere into the vocab documentation, and a Forum post pointing to it.
@cgreich but I don't think the dates in the Vocabulary for NDCs are good anyway. They seem to be way tighter than they should be.
@ericaVoss: Why are you saying that?
@cgreich I was seeing a bunch that were outside of the dates, here are some examples from Vocabulary 20171201
NDC | DATA_START_DATE | VOCAB VALID_START_DATE |
---|---|---|
52544017572 | 2007-01-01 | 2012-12-01 |
49884059201 | 2001-01-01 | 2009-10-01 |
00039006011 | 2006-01-01 | 2007-06-01 |
What my code does it build eras of exposure of the drug, as long as there is >100 persons exposed year by year I allow the dates to move out.
@ericaVoss:
Which is brilliant. We should do a network study. The result should go to the NLM and Olivier. Like in "What the heck is going on with these? They are not part of your reuse list."
@cgreich - I know I'm dragging my feet there. Mui, Vojetech, and @dimshitc have been on me!
https://github.com/OHDSI/StudyProtocolSandbox/tree/master/NDCReview
@ericaVoss:
Perfect. Don't have to wait. We can get started and inundate Olivier. Who is on it? @dimshitc probably.
So far I had: @ericaVoss @dimshitc @aostropolets Vojtech @cgreich
@TinyRickC137 What if we start tracking this at least? here's the query so you can produce the list of concept changed their meaning each release. and track as long as possible in the past. Then the look up can be sent to the users. It might be better than changing overall vocabulary rules as I suggested.
Note, the query might return some false-positive results, so some manual review is needed. I compared the 2017 (the oldest we have) and Jan-2023 version, and got around 900 of concepts, while around 100 are a real reuse. also I can share this table, so someone can review it and we'll have a nice list of reused NDC at least from 2017
Looking NDC
00069551066-cetirizine hydrochloride 10 MG Oral Tablet [Zyrtec]
in the 20160311 VOCAB we have:Looking in raw claims data, it looks like this NDC was used earlier. Here is the row counts by year in one of our large claims datasets:
Addition, the net refers to this NDC earlier than 2010 - here is an example that calls this NDC as Zyrtec in 2005. http://www.mainecarepdl.org/sites/default/files/ghs-files/manufacturer-rebate/2005-05-19/mainemomanufacturerq105.pdf
I know we are getting these dates as feeds, but thought I would bring it up as an example just in case something was going on.