Automation of database extraction?

Ecaloota commented 4 years ago

Hi, Have you considered trying to automate data extraction from the orange book, using i.e. Selenium in Python?

yaelago commented 4 years ago

Hi @Ecaloota, we have discussed automated options, but did not feel it would easily provide all the information we are looking for. If you have an idea of how to automate the system in an efficient way, we are very open to hear and work together on this.

Ecaloota commented 4 years ago

Hi @yaelago. If the data you need could be condensed into a spreadsheet-type form (i.e. name, company, under patent, application date, number of generics, etc), which I think it can(?), I could give it a shot. I'm a bit busy with work, but I've done something like it before.

How many data points are you looking to get data from? If you could send me like 50 entries that you want data for (including combination therapies, and examples of drugs that aren't covered by the orange book, ideally), I could try over the next few weeks.

You should have my email already from the entries I did a few days ago on your Google form. We can talk there.

Ecaloota commented 4 years ago

Actually I've been looking around the FDA website and they publish plain-text data files which look to contain all the information you need (current as of March 2020). See https://www.fda.gov/drugs/drug-approvals-and-databases/orange-book-data-files.

If you look at the "Data Descriptions" header and you find that all the data you want is there (or can be derived from there), I can write you some code that will extract all the data into the form you need.

narath commented 4 years ago

hey @Ecaloota - I agree - I just downloaded the data files - here was the latest link and it looks like they are just ~ seperated files, which can easily be read by python or ruby, and it includes:

products.txt
patents.txt
exclusivity.txt

Which seems to include a lot of what you are interested in.

If you don't know how to process these, then perhaps we could arrange a quick 15 minute hackathon to see if Ecaloota or I can make quick progress on this, and if not, will at least be able to get a reasonable spec that you could do some promoting with engineers without borders or other groups.

Best, Narath

yaelago commented 4 years ago

Thanks @Ecaloota and @narath for your suggestions and investment in this project. I confess I don't have the experience in processing such files, but if as you suggest it is a simple coding task that will save many of us much time, that sound amazing! It could also expedite the next phases of E$$ENTIAL MEDICINDE$ - looking into medicine prices (more on that on a separate issue coming soon) and opening up the review to other countries around the world.

borawl commented 4 years ago

I was wondering same thing. Happy to help out creating a upload file.

borawl commented 4 years ago

@yaelago where are you storing the data so far? What format?

yaelago commented 4 years ago

Hi @borawl, the data is currently stored in excel files, but the plan is to store it in a database format. We are currently working on the design requirements for such a database. Suggestion always welcome :)

borawl commented 4 years ago

What type of database?

yaelago commented 4 years ago

Not decided yet. An RDMS probably as it seems the best way to store and access the data. But we want to keep the data open access, so still need to investigate into that.

borawl commented 4 years ago

Sql Server may be an option. Yes RDBMS. What criteria did you use to come up with your list of drugs on the first form?

borawl commented 4 years ago

I looked at the files @narath pulled. My suggestion would be to start with a smaller dataset to figure out the nuts and bolts of doing the extraction and eventually the upload into a database. Maybe pick one or two drugs to focus on. The largest amount of time will be spent on cleaning up data. If we are lucky, the FDA has taken care of the scrubbing. Thoughts?

yaelago commented 4 years ago

We are working with the world health organization (WHO) list of essential medicines for infectious diseases. Defined as medicines which "satisfy the priority health care needs of the population" and to which people should have access to, at all times, in sufficient amounts. https://www.who.int/medicines/publications/essentialmedicines/en/

Ecaloota commented 4 years ago

Hi @yaelago, @borawl, and @narath. I've thrown together a quick Python script which I think is working (i.e. it looks to be pulling out the info on the Google Form drugs and combination drugs from the three FDA files), but would benefit a lot from other, more experienced, sets of eyes.

At the moment it just outputs pretty rough data to a CSV, and I haven't written any tests or done much error-checking, so if someone wants to give that a go, they can, otherwise it might just be helpful to the project in combination with what E$$ENTIAL MEDICINE$ is already doing. I'll probably chuck it up as a repo, so you can fork it from there etc. if you want.

And let me know if you want the output form changed or more data output... I mainly did it this way because it was quick.

Archive.zip

yaelago commented 4 years ago

Thanks @Ecaloota, this is really helpful! There are a few additional point of information we are interested to include in the database. For example which of the applicants is a "new drug product" and which medicine is protected by a patent or an exclusivity agreement. Could we think together what the best output would look like?

alintheopen commented 4 years ago

Hi @Ecaloota @narath and @borawl - thanks so much for your contributions and suggestions! So great to see you all here and we really welcome you to the project discussions. The database output will be key so any and all advice here is very welcome. Similarly, I'm super interested by the data extraction. We think that it might be trickier to do this for some markets compared to others, but this would of course hugely expedite the process and provide a complimentary method that would help us to check accuracy of data. Would you join us for a community meeting to discuss this a little further? Warmest wishes, Alice

Ecaloota commented 4 years ago

Hi @yaelago, that’s no problem at all. I actually already collect that information in the script but probably haven’t output it explicitly; I’ll spend a few hours today writing some tests to make sure I’m not missing any drugs (at least from the list that I have) and we can discuss output format at some point. @alintheopen I’m happy to attend some sort of meeting if you’re organising one. I work remote so can make basically any day/time

alintheopen commented 4 years ago

Thanks @Ecaloota - if we could a rough idea of timezones that people are living/working in, that would help us to set up a meeting. Cheers ALice

Ecaloota commented 4 years ago

Hi @alintheopen. Fair enough; I'm in Canberra, AEST

borawl commented 4 years ago

Hi everyone! EST, Virginia, USA

Propargyl commented 4 years ago

Re next stage, it seems like there are a lot of accessible databases with various types of pricing information (eg. http://www.haiweb.org/medicineprices/national-medicine-prices-sources.php). In theory you could scrape the information/use APIs if available. You could automate/schedule, and track prices over time, statistical analysis, etc.

There are also some websites that have compiled historical pricing data/allow analysis - some of these are commercial, which suggests implementation of the above might not be trivial.

Ecaloota commented 4 years ago

Hi all. I've made some important fixes to the code I put up the other day and I'm much more confident that it's working correctly now. All the drugs/combination drugs I checked manually are being found correctly, and their generic forms and application dates etc. found also.

If you want to run it with more drugs than I've done (I've only tested on 18 drugs/combination drugs), all you would need to do is change which drugs you want data on in the "essential_drugs_list", which is near the bottom of the code, then run the script. I'm happy to fix any bugs that are found, and change the output format when I get some more info on that. I've also updated the output to include the Application_Type ("N" or "A") per @yaelago's request.

I've also made some other minor bug fixes and implemented a "bad_words" list, which excludes drugs that contain the name of an essential_drug within it, but is itself not an essential_drug. For example, 'valacyclovir' contains 'acyclovir' which is in the essential_drug_list, but 'valacyclovir' is not itself in the current essential_drug_list, so valacyclovir should and does get excluded from the final output.

orange_book_parser.py.zip

yaelago commented 4 years ago

Wow @Ecaloota , you really put a lot of effort into this. thanks so much! I am sorry to be a pain, but since I can't open the code you sent on my computer, can you send an example for what the data extracted looks like? I was thinking we should test this code on one of the complex medicines (e.g. azithromycin or ampicillin) with lots of new and generic applicants and then compare it to the data we have been getting form the form.

Ecaloota commented 4 years ago

Hi @yaelago. No worries, it's fun. Here's the output I have. It should already have data on ampicillin and azithromycin.

output.csv.zip

In fact, here's one with only ampicillin:

output_ampicillin.csv.zip

and here's one with only azithromycin:

output_azithromycin.csv.zip

When I open them in Excel, I get weird text-wrapping in the Applicant_Full_Name column such that some of the entries look empty - you can turn off text-wrapping for those cells and then you should be able to see them. I can probably fix that somehow later.

To run the code (if you want to), you only need Python 3, which should be installed by default on Mac, or can be installed easily enough on Windows. If you decide to install it and can't run the script, let me know because I could have introduced some weird iOS-specific system-dependencies. You'll also need the text files from the FDA (link is in this issue thread).

Reading the code nicely can be done with (for example) Sublime Text, Visual Studio Code, Atom, IDLE (python default), or even in iTerm or the terminal. Sorry if you already know this, just thought it better to be thorough.

Ecaloota commented 4 years ago

Oh, and if the Under_Patent, Latest_Patent_Date, Exclusivity_Agreement, and Latest_Exclusivity_Date columns are empty for a particular drug, that means there is no patent or exclusivity agreement for that drug (that I found). For example, in the azithromycin output file, it looks like there's only one form that's under patent, and that's under patent until 2024.

Ecaloota commented 4 years ago

Actually I think I've found a bug in how generics are being found. For example compare Line 15 of the azithromycin file with lines 4-12. I think lines 4-12 are generic forms of line 15? I will fix this tonight hopefully.

Ecaloota commented 4 years ago

Ok, fixed.

UpdatedFiles.zip

Those generic drugs with an application code of "A" that remain (for example, in the ampicillin file) don't have corresponding new ("N") forms. They look like they should, but if you look carefully, I think that it's correct per your requirements. For example, in the ampicillin document, lots of "A" drugs have the same dosage form, route of administration, active ingredient, and strength, but there is no "N" drug that matches each of these exactly.

alintheopen commented 4 years ago

Wow, this is sensational. Can't wait to have a closer look tomorrow and we'll also try to arrange a time for a community meeting too.

yaelago commented 4 years ago

Ok, fixed.

UpdatedFiles.zip

Those generic drugs with an application code of "A" that remain (for example, in the ampicillin file) don't have corresponding new ("N") forms. They look like they should, but if you look carefully, I think that it's correct per your requirements. For example, in the ampicillin document, lots of "A" drugs have the same dosage form, route of administration, active ingredient, and strength, but there is no "N" drug that matches each of these exactly.

@kym834 could you comment on "A" medicines not having a corresponding "N" medicine?

yaelago commented 4 years ago

@Ecaloota , these outputs are awesome! I've started comparing them with the data we received from participants, and its looking really good!!! For sake of easier comparison, would it be possible to have the output in the same order as the data we have currently collected (Applicant Holders, Market Status, Active Ingredient, Propriety Name, Dosage, Route , Strength, FDA approval, Patented, Expiry date, Exclusivity, Expiry date Generics). Also, for some reason the Dosage and Route come out in one column, is there a way to spit them into two separate ones? Once we confirm the data in this format matches the data collected manually, we can rapidly move on to the next phase of the project! that is really exciting! Thanks for your amazing help!

kym834 commented 4 years ago

Hi @yaelago and everyone!

I had a quick look at the ampicillin file. The data is definitely coming out as per our requirements. This is great stuff @Ecaloota! I know that there are some other medicines which also have 'A''s that do not have a matching 'N'. Ampicillin is interesting though, a few things

I can see that rows 1, 2, 8, 9, 10 and 11 all have the trade name omnipen. This is odd as normally, if they are an 'A' they don't have a registered trade name only 'N's do.
If they really are 'A' I think they actually have a corresponding 'N' just that the strength has been input in a different manner. These have it as a straight mg and the N's is as EQ (equivalents) of base. Ampicillin itself actually a base but so I would think that these are actually the same thing.
Rows 19-28 are legitimate A's without and N as are rows 41 and 42.

I don't think the code would be able to be adjusted to solve any of this though @Ecaloota??

I think we may just need to do a quick check of each of the data sets. Something we will be doing anyhow when we move it into the database.

Ecaloota commented 4 years ago

Hi @yaelago, I will change the output format into the order you've requested. Should be done tomorrow after I finish work. @kym834, I can modify the code such that the ampicillin rows 1,2,8,9,10, and 11 are detected correctly, but I imagine more generally this would need to be handled on a drug-by-drug and strength-by-strength basis unless "EQ 200MG BASE" is always going to be equivalent to "200MG" for all drugs in all situations, for example, in which case handling equivalents with different strengths will be trivial. Either way, I can fix it, but one way is a lot neater than the other.

yaelago commented 4 years ago

@kym834 , I'm wondering whether this would be a problem with the manual data extraction as well. Since the data is mined by "N" medicines, those who don't have a corresponding "N" may be overlooked.

kym834 commented 4 years ago

@Ecaloota, yes, it would have to be handled on a drug by drug basis. Ampicillin is an exception. The EQ 200mg base is not always going to be equivalent to 200mg. I'm wondering if it would it be possible to have two different codes? We could run the one that you have developed as that will work in the majority of cases. Then, in our check if we see there are any remaining 'A's that, as with ampicillin, we think actually do have an 'N' we could run it through the second code that accounts for the differences in the strength input as you have suggested. I don't know how often this will occur though and therefore will it be worth our time...

@yaelago, I think it is. In the development stage when I was looking at the data I only found one case where there was one extra 'A' without a 'N', for amikacin. I noted it down so we could add it in but didn't think it would be too much of an issue with other medicines. Ampicillin has proven me wrong on that front! If we are going to continue with the manual collection (??) we should add a section into the form to capture any remaining 'A's. Or even just a yes/no and then we could use the code to pull them out and add them in.

Ecaloota commented 4 years ago

Dear all, I've attached the output in the format requested by @yaelago several days ago, and the most recent version of the code. Sorry about the delay, work's been hectic. I've also resolved the equivalence issue for the special case of ampicillin per @kym834's comments.

Please feel free to let me know if you need anything else.

Archive.zip

p.s. If you need the data on drugs other than those I got from the initial look at the Google form, I can run the script with those new drugs as well (if you need).

yaelago commented 4 years ago

Is is great @Ecaloota! thanks so much! The Google form has been updated with time with new medicines to review. The full list of the medicines we are working with is here- WHO essential anti-infective medicines.xlsx It would be great if you could run the code on the whole list!

I do see the output needs to be cleaned out a bit before exported to an external database, but this is something we would need to do manually, using our understanding of active ingredients and combination medicines.

You might have seen we now closed the form for additional contributions #5 , and are planing the next stages to review. We may have some additional information to extract from other sources, in which automation could assist, so we will keep you @Ecaloota @borawl @narath in the loop 🥇

Ecaloota commented 4 years ago

Hi @yaelago, I've attached the output for all the drugs you listed in your spreadsheet. Note that I made a few changes, but other than that didn't sanitise the input (didn't spell-check, check for duplicates, etc):

Changed 'aciclovir' to 'acyclovir' (the former wasn't in the FDA files)
Made 'sodium stibogluconate or meglumine antimoniate' into two entries (sodium stibogluconate; meglumine antimoniate)

The final file is large (~1000 lines) and from a quick glance there looks to be quite a few issues with picking up the generic forms (like we had with ampicillin). I think overall you're still ahead, because lots of drugs have generic forms, and it's still picking up patent and exclusivity info, but let me know if there's anything annoying you want me to fix, particularly if it saves you a lot of manual time.

EOBZIP_2020_04.zip

kym834 commented 4 years ago

@Ecaloota, thanks so much! I see what you mean about the generic issues. No matter, this has definitely speed up the process and given us all the other information we were looking for. We will have a look through it and do a clean up. If we come across any easy fix issues will let you know. Thanks again!

narath commented 4 years ago

Wow team, so cool to see all the progress (esp for the cool scripts from @Ecaloota), sorry to have dropped out (have ongoing clinical responsibilities here in Boston MA). Excited to see ongoing progress.

kym834 commented 4 years ago

Hi All :wave: @yaelago and I have been working on cleaning up the automated output and have come across an issue. Somewhere along the line we lost the approval dates for the N's. It looks like it was when we changed the order of the output to match that of what we had in the manual data collection.

@Ecaloota (or anyone who is able to help out) is it possible to add the approval date back into the code and run it again? It is pulling the approval date out for those that it picks up as generics so it is only for those that the code doesn't recognise as a generic that needs to be updated.

Additionally, we've been thinking it might be a good idea to also pull out the application number to use as an identifier for each entry. We think this will be helpful later on especially if we extract data on the medicines for other databases. We would then be easily able to match it to what we already have in our public database. Thoughts on this?

Ecaloota commented 4 years ago

Hi @kym834. I've made the requested changes to include an application number and product number (together they make a unique identifier for a particular drug) and made it so both are printed for each drug, including generic drugs. The format for the application number is XXXXXX (six digits), and the format for the product number is XXX (three digits).

EOBZIP_2020_05.zip

If you open the attached .csv document in Excel, you will need to use a custom "Cell Format" to correctly display the application number and the product number, because excel is very insistent about removing leading zeroes (for example, if the application number is 000123, it will appear as 123). I had to learn how to do this so I could verify it was being done correctly; the link I used is: https://www.ablebits.com/office-addins-blog/2017/01/18/add-leading-zeros-excel/

I've also added the approval date back into the document for each of the N drugs.

Let me know if you need anything else.

kym834 commented 4 years ago

Cheers @Ecaloota, it looks good. Also thanks for the tip on leading zeros! Worked perfectly.

kym834 commented 4 years ago

Hi Everyone :wave:

We are looking at expanding Stage 1 of E$$ENTIAL MEDICINE$ to Australia. We would like to hold a community meeting to discuss automating the collection of data and to find out if this is at all possible :smile: We would love to have your input on this and have started a new issue (https://github.com/TheBreakingGoodProject/Essential-Medicines/issues/7) to discuss potential date and times taking into consideration that we live all around the world.

TheBreakingGoodProject / Essential-Medicines

Automation of database extraction? #2