coderxio / medication-diversification

More realistic synthetic medication data.
Other
12 stars 3 forks source link

Figure out an automated way to find first available date of a medication product #30

Open jrlegrand opened 3 years ago

jrlegrand commented 3 years ago

Problem Statement

If we don't put any date limits on our modules, patients in 1901 are going to be getting modern medications.

Criteria for Success

Some automated way to query FDA to link first available date of drug application to a product RXCUI or NDC.

Additional Information

See example here: https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm?event=overview.process&ApplNo=075743 Original approval date = 07/01/2002

A lot of modules list this in the remarks: https://github.com/synthetichealth/synthea/blob/master/src/main/resources/modules/medications/moderate_opioid_pain_reliever.json

    "Availability dates for different drugs came from the FDA: ",
    "http://www.accessdata.fda.gov/scripts/cder/daf/",
    "Dosage information came from FDA reference sheets available at: ",
    "https://www.drugs.com/dosage/",
    "The following medications are prescribed in this module. Most opioids were invented ",
    "during or prior to WWII, but did not become widely available until after the war:",
    "1. Meperidine (Demerol)                      available after 1945      ",
    "2. Hydrocodone/Acetaminophen (Vicondin)      available after 1943      ",
    "3. Oxycodone/Acetaminophen (Percocet)        available after 1920 (est)"
kristentaytok commented 3 years ago

Could the "Active Start Date" in RxNav be an easier (and possibly more accurate) substitution here? I say possibly more accurate because some drugs take longer than others to go from FDA approval to market entry.

https://mor.nlm.nih.gov/RxNav/search?searchBy=String&searchTerm=airduo --> click on "Status", Active Date = 6/1/17

https://www.asthmaeducators.org/resources/Pictures/AirDuo%20Launch%20Press%20Release.pdf - market launch was ~April 2017? https://www.accessdata.fda.gov/drugsatfda_docs/nda/2017/208799Orig1s000TOC.cfm - FDA approval was Jan 2017

jrlegrand commented 3 years ago

See this Synthea GitHub issue conversation: https://github.com/synthetichealth/synthea/issues/236

jrlegrand commented 3 years ago

@kristentaytok - do you know where that "Active Start Date" metadata is stored within RxNorm?

jrlegrand commented 3 years ago

This looked promising. See the data structure diagram under the accordion dropdown: https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files

kristentaytok commented 3 years ago

Not sure if this helps; I haven't worked much with RxNorm/Nav/Class, but happy to look further into it.

https://rxnav.nlm.nih.gov/api-RxNorm.getRxcuiHistoryStatus.html

jrlegrand commented 3 years ago

@kristentaytok @Bridg109 @yevgenybulochnik - https://rxnav.nlm.nih.gov/REST/rxcui/1918195/historystatus.json (!!!)

jrlegrand commented 3 years ago

@kristentaytok - you beat me to it!

jrlegrand commented 3 years ago

Tylenol says it was active start date of 2005 ... so I don't think this field means what we think it means...

kristentaytok commented 3 years ago

Sad

Hm, FDA only has active NDCs from what I can tell--but it gets updated daily: https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory?elqTrackId=6ecc57b6b7ec4fa7bdcb28cd7a59da1b&elq=795c4bbff787408d9131556bb0434617&elqaid=4397&elqat=1&elqCampaignId=3471

the NDC database files that come from the above link are broken down to product (first 9 digits of NDC) and package (11 digit NDCs). the product file has more details like drug names. but "startmarketingdate" goes back to 1900

jrlegrand commented 3 years ago

@hodgesr2 - can you start looking into what is possible for this?

Basically, I would want to know the very earliest date (or even just a year) a medication product was available in a computer understandable and easily updatable format. Ideally, this would be for more than just active NDCs.

@kristentaytok has started doing some analysis so please reach out to her if you have questions about what she's commented above.

hodgesr2 commented 3 years ago

yeah I'll take a look at it and see what I can find.

hodgesr2 commented 3 years ago

Also sent a email to the FDA to see if they have some info on any data that may help with is.

hodgesr2 commented 3 years ago

I found this... but can't seem to find a CSV with all the information. They are all separate CSVs for each drug entity. But it has an action date which I think translates to approval of the medication on the market.

https://www.accessdata.fda.gov/scripts/cder/daf/

hodgesr2 commented 3 years ago

Orange book has some decent data I think.

image

Doesn't have NDC, but has ingredient and approval date. However, the approval dates don't go back beyond 1982

LINK to data: https://www.fda.gov/drugs/drug-approvals-and-databases/approved-drug-products-therapeutic-equivalence-evaluations-orange-book

Orange Book Data files compressed

hodgesr2 commented 3 years ago

Think I found what may work.

NDC database file. has first two sections of NDC and Start Market Date. Looks accurate from a spot check I did on some drugs.

image

LINK To data: https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory

Download NDC database file

hodgesr2 commented 3 years ago

Let me know if that is what you were looking for

jrlegrand commented 3 years ago

I think @kristentaytok mentioned this above but noted that it's only active NDCs... but maybe this is as close as we are going to get.

In my lisinopril example way up at the top, I do see that the earliest start marketing date is represented as 07/01/2002 in the FDA spreadsheet.

We need join a product-level RXCUI (SCD/GPCK/SBD/BPC) to the NDCPACKAGECODE in the package.xls file and then join the product.xls file on PRODUCTNDC and select the earliest date.

jrlegrand commented 3 years ago

@hodgesr2 - you can use this file as an intermediary table to do some testing: https://drive.google.com/file/d/1Nc-Ri2EpDQzebiPZsjXAvBDtodw_BGLc/view?usp=sharing

Let me know if you have questions. We should end up with a file with three columns: MEDICATION_PRODUCT_RXCUI, MEDICATION_PRODUCT_TTY, and EARLIEST_START_MARKETING_DATE.

jrlegrand commented 3 years ago

@kristentaytok - the other question is how we actually implement this. I'm guessing it would be crazy complicated to add this to the CSV distributions... so my other thought is that we do a check AFTER the patient gets randomly assigned a product - check if the year is greater than X, if so - continue on - if not, go back to the ingredient distribution check. We would also want an initial date check to see if we are at least past the earliest date that at least ONE of the meds were available - otherwise, we may get stuck in an infinite loop and break Synthea lol.

jrlegrand commented 3 years ago

OR I guess how hard would it be to add to the CSV (if we can have Date as an attribute)... like if it was before a date a product was available then the distribution would always be 0 for that product. This is also wonky b/c we are trying to apply 2018 distributions to all the years prior to 2018 - back to 1901...

kristentaytok commented 3 years ago

I think the easiest implementation on the MEPS-CSV creation front is to create distributions for each calendar year. So basically any year after the earliest_start_marketing_date to present is given a row, and similar to products that aren't being used in a certain state/age group/gender group, those that have NULLs in a given year will be filled with a zero. So we're just adding a year column to the CSV files that gets aggregated to calculate a unique distribution for each year.

Not sure about how to change the Synthea JSON part of the implementation though. But would this approach for the CSVs help?

kristentaytok commented 3 years ago

I think @kristentaytok mentioned this above but noted that it's only active NDCs... but maybe this is as close as we are going to get.

In my lisinopril example way up at the top, I do see that the earliest start marketing date is represented as 07/01/2002 in the FDA spreadsheet.

We need join a product-level RXCUI (SCD/GPCK/SBD/BPC) to the NDCPACKAGECODE in the package.xls file and then join the product.xls file on PRODUCTNDC and select the earliest date.

I might be missing something here, but i was thinking we just need to use the "PRODUCTNDC" file and join on the first 9 digits of the NDC from the RXCUI_NDC table with the PRODUCTNDC (no need to use the PACKAGE file/NDCPACKAGECODE).

jrlegrand commented 3 years ago

@kristentaytok yeah I think you are right that we wouldn't need the product file - just the package file. And I think the CSV idea you had is what we should do for the implementation if possible. Maybe I'll look into seeing if it's possible to use date in a table transition.

jrlegrand commented 3 years ago

One thought though is that if we want to get to the month or even day level with start dates for med products, that distribution strategy would get even more unwieldy. But year is how I've seen other modules do it so I think it's ok.

kristentaytok commented 3 years ago

Yea I was thinking about the month/day stuff as well, but agree it would get way too big and not worth the extra detail.

It'll be pretty easy to implement the column in the CSV, but my question is for the MDT do we want to download and store daily versions of this file in our db or just connect to the site/link and have it run on the day the user runs the MDT? The disadvantage to the latter is the user may not get the same results if drugs/codes get removed from the file between runs.

jrlegrand commented 3 years ago

For MDT database loading, I think we would treat the FDA NDC file(s) like we do the RxNorm files. We run them once when the user initializes the MDT app, and then give the user the option to periodically refresh all the database tables, or pick specific ones to update (i.e. RxNorm or FDA or MEPS). I'm mostly concerned about not prescribing modern drugs in really ancient years - not about missing cutting edge therapies... unless I'm misunderstanding your concern.

For MDT JSON, we also will probably still want to know the earliest possible year from all the products in the submodule so we can bypass all the logic if the year is earlier than that because we know the patient will not get a match.

hodgesr2 commented 3 years ago

I liked the idea of doing a check. To prevent an infinite loop, could we create a random number generator for the year date between a certain range? Like, if check is false, then go to random generated number for the year within a range and recheck? I wonder how much of an extra computational load it would put on it to create the dataset like this.

I'm going to be in Aruba rest of this week and half of next week. I will work on that intermediary table over the weekend probably in between the beach and eating @ brazilian steakhouse and german beer.

kristentaytok commented 3 years ago

@jrlegrand - makes sense given our main goal is to prevent inappropriate med orders from earlier years. My concern was more around reproducibility (and is probably more of an edge case than a major issue): if we ran the NDC file today and the earliest_start_marketing_date for drug X was 1950, for example, and the NDC became obsolete tomorrow and fell off the list such that the next earliest_start_marketing_date for drug X became 1990, then the user running the MDT tomorrow or later will get very different results for drug X vs. what we get today. Hope this makes it more clear?

But again, since the priority is more to address the bigger issue that we're properly prioritizing (preventing med orders from getting created before a drug was available), then this smaller issue may just be more of a caveat that the end-user can consider addressing by downloading and saving the daily FDA files over time and taking the earliest_start_marketing_date across the files.

jrlegrand commented 3 years ago

@kristentaytok - I hear you now. Yeah I feel like this might be more of an edge case but not really sure. Still probably the best data source we have for now.

hodgesr2 commented 3 years ago

@jrlegrand do you want the intermediate file to have a date data type for market date as well? Currently it is in integer format with year first.

jrlegrand commented 3 years ago

Yea I think so

jrlegrand commented 3 years ago

At very least we just need to pull year out of that integer

hodgesr2 commented 3 years ago

Ok. Cleaned_CSV.csv is loaded into google drive. Let me know if that is what you want. The NDCs may load without the leading zeros in excel. It's annoying. May have to format the column and do a special type of format of "000000000" to get the leading zeroes on. There are 3 date columns: the original, the year, and a date format column.

hodgesr2 commented 3 years ago

Link. Forgot to add

https://drive.google.com/file/d/1XoGxvs2bC0GhVD8tlAuO1PjpT9o95wdL/view?usp=sharing

hodgesr2 commented 3 years ago

There are also NULLs for the TTY and RxCUI columns. I didn't know if you wanted those removed or not but I left them in there in case you needed to validate anything.

Bridg109 commented 3 years ago

What software did you use to put this together?

On Mon, May 10, 2021, 5:10 PM Robert Hodges @.***> wrote:

There are also NULLs for the TTY and RxCUI columns. I didn't know if you wanted those removed or not but I left them in there in case you needed to validate anything.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coderxio/medication-diversification/issues/30#issuecomment-837418246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJUPMCQQ4GJDRLE7VBVT3ZTTNBKT3ANCNFSM424AJBTA .

hodgesr2 commented 3 years ago

Excel and Python (Anaconda).

jrlegrand commented 3 years ago

@hodgesr2 - we actually need NDCPACKAGECODE (the PACKAGE NDC) instead of the PRODUCTNDC unless we have a really reliable way of getting from the MEPS package NDCs to product NDCs. Also, could you please pull in the medication_product_name field for human readability? Did you try to join to the RxNorm rows based on PACKAGE or PRODUCT NDC?

hodgesr2 commented 3 years ago

k. I'll take a look at it again Friday when I get back. I joined on PRODUCTNDC. I'll try and redo it with NDCPACKAGECODE

Bridg109 commented 3 years ago

And it would be ideal if solely done in python that way we can code the process into automatic process

On Mon, May 10, 2021, 7:52 PM Robert Hodges @.***> wrote:

k. I'll take a look at it again Friday when I get back. I joined on PRODUCTNDC. I'll try and redo it with NDCPACKAGECODE

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coderxio/medication-diversification/issues/30#issuecomment-837570962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJUPMCRRTXFW2EKHVYGWKWTTNB5TVANCNFSM424AJBTA .

hodgesr2 commented 3 years ago

do you want it in a ipynb file that jupyter uses (my preference) or prefer a py file?

Bridg109 commented 3 years ago

Py file would be best but we can convert it from jupyter pretty easy if you want to do it that way

On Mon, May 10, 2021, 8:15 PM Robert Hodges @.***> wrote:

do you want it in a ipynb file that jupyter uses (my preference) or prefer a py file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coderxio/medication-diversification/issues/30#issuecomment-837590825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJUPMCUFQHOZE6ZC32WCV4TTNCAJZANCNFSM424AJBTA .

jrlegrand commented 3 years ago

@hodgesr2 - we should also pull in ENDMARKETINGDATE

jrlegrand commented 3 years ago

Reopening because we never really finished this feature.

jrlegrand commented 2 years ago

Synthea has an attribute called time (I think) that can be used in a transition table. It's the number of seconds since some year.