Automated Extraction of Stage II Data, Australia

Ecaloota commented 4 years ago

Hi all,

Following our meeting on the 13th July, I would like to update you all on my progress with writing a script to automate data extraction from the ARTG, which has been moderately successful.

I used the database at the EBS to extract a list of all the potentially-relevant ARTG IDs (I didn't use this one, because it doesn't allow logical operators, but the two databases say they are linked to the same source) by searching for all drugs that were of interest in Stage I (I added "aciclovir" myself as it appears to be the more-common spelling of "acyclovir"). This had to be done in seven steps due to limitations with the Aus Gov servers, so duplicate ARTG IDs introduced at this stage were removed manually.

I then wrote a bot to search for all the drugs by ARTG ID on the latter database. Most of the available information is on this webpage, so I extracted the product names, strengths, IDs, dosage forms, routes of administration, and sponsor. If the entry had an associated Product Information PDF file (apparently only 90% have one, but I think it's probably less than that), I downloaded that automatically and converted it to a plain-text file. At this stage, it became clear that 530 of the entries I'd collected were of random multivitamins that were included due to ambiguities in the search (my fault), so I removed those manually in Excel. The thus-cleaned list of ARTG IDs is here.

Looking through the remaining entries, I also found some combination drugs that were comprised entirely of drugs that were individually in our search, but not found in that particular combination. For example "abacavir + lamivudine + zidovudine". I kept those in the cleaned list because I considered that they could be important also. The list of ARTG IDs for these ambiguous combination drugs is here.

Of the data that remained to be extracted (under patent/patent date, exclusivity/exclusivity date, market status, generic forms), I don't think the ARTG stores the data for these, which makes determining whether one drug is a generic form of another impossible on the basis of the criteria we used in for Stage I. Also, I believe that collecting information on approval (y/n) might be redundant in this case, as only the approved drugs seem to appear on the database? Happy to add this to the data if I'm wrong, of course.

That only leaves approval dates, which are the only piece of data in the PDFs that can't be extracted from the web-server. So, because of very inconsistent formatting of both the PDF files and the product name conventions, I did both manually for the remaining ~1500 entries based on what I had already collected into the spreadsheet. The output of my script following manual cleanup is here, (and prior to manual cleanup is here).

The Python script and bot itself is here. I would not recommend running it if you have limited data from your internet provider because it will try to automatically download thousands of PDF files, and has many dependencies. I can also send you all the downloaded PDFs, but there are too many to be uploaded here.

As always, it would be great to have everything double-checked, and I can make adjustments if required in about a week and a half.

@yaelago @kym834 @alintheopen @fantasy121

kym834 commented 4 years ago

Hi @Ecaloota,

Looks like it has been very successfully, not moderately! Comments/thoughts below just based on the order of what you have written above.

All good on EBS, like you said same source.
British vs. American english at play with aciclovir/acyclovir I think. Something we need to keep an eye on as we expand to other countries and with other medicines as well.
Is there a way to remove the random multivitamins without having to do it manually? For some context, the list we have been working with only includes teh anti infections medicines that are on the WHO essential medicines list. In future, we want to expand to the whole WHO list which is much bigger and thus this might be quite time consuming.
The other combination medicines may be in other sections of the WHO's list so, you they may be important in that regard. Especially in later stages when we expand to the whole list.
Yup, we didn't find any of that information on the ATGR website either. We will have to see if we can find any of this information elsewhere. Also it may be possible that the patent information we collected for the USA are also relevant for Australia if the patent in worldwide. Again something we will have to look into further. I have a vague memory of a status but I think it was register status and it was in the product information pdfs. I went back and had a look through a few pdfs for different medicine and couldn't find it. It may be that this is not consistent across all the information documents as with the approval dates.
We'll have a think about how we might be able to identify generics. Maybe this information is held somewhere else or we just need a new criteria.
Approval is somewhat redundant. With automation if we don't get info for it, highly likely that it is not approved. We had that in when we were manually searching for the information and had to check each one individually. It would useful for a separate data set that could be used for comparisons of which medicines are approved and which aren't in different countries. But I think this would be a separate excel file to this one.
You did them all manually! I hope that didn't take too long.
Output files look good :)

I'll have a deeper look at the output file and mull over some of the things that have been brought up over the weekend so that we can discuss more in the meeting next Monday (#9)

Thanks so much for doing this work and the update!

kym834 commented 3 years ago

Hi @Ecaloota,

Hope you are doing well! I was wondering if you had some time in the next couple of weeks at all the meet and chat (over Zoom) about some of the finer details in the extraction of the data for the anti-infection essential medicines and using the code to extract data for the other essential medicine once we have the process all refined?

Maybe we could do a workshop style where you could show us how it all works as well. It would be good to for us to understand how it was conducted so that when asked about this when we are presenting and sharing info about the project we can explain it to those who are interested. I am also just very interested in this personal :)

Would be great if we could set up a date and time and then anyone who is interested and can make it can also join us.

Ecaloota commented 3 years ago

Hi @kym834,

I'm going well. Likewise, I hope you are going well and that the project is progressing well. I'm able to shift my schedule around to have a meeting at any time in the next few weeks (I realise that's probably not super helpful in terms of organising a time), including later in the AEST afternoon if you want to leave it open to a time where non-Aussies can attend. If it's okay, please let me know a date/time when you're all free and I'll ensure I can be there.

It's been a while since I looked at the code, but I will try my best to explain what it's doing and what I did. If you have specific questions about finer details, it would help me a lot if you could post them prior to the meeting so I have time to jog my memory about the specifics.

kym834 commented 3 years ago

@Ecaloota Awesome. Let's do Thursday 26 Nov at 9am. This is 5pm for the east coast US so if anyone would like to join us it's still at a reasonable time and they are more than welcome to.

Zoom details: Time: Nov 26, 2020 09:00 AM Canberra, Melbourne, Sydney Join from PC, Mac, Linux, iOS or Android: https://uni-sydney.zoom.us/j/86868941173

kym834 commented 3 years ago

Hi everyone (@Ecaloota, @narath and @borawl in particular as I think you all have some experience with coding),

Background: I've been working on expanding the database to include all of the essential medicines that are available in Australia. I've got a text file of all the Australian Register of Therapeutics Good (ARTG) ID numbers which I created in November last year.

My problem: Between now and then some of medicines on them have been removed from the register. My problem is that when I run the code that @Ecaloota created to extract all the information from the webpage and PDFs if there is no result for the ARTG ID eg. it has been removed from the register, I get an error.

My temporary solution: I've been manually removing each ID from the text file as they come up and starting the code from scratch. I've had about 15 out of the first 500 or so ID numbers. With over 6000 IDs it's going to take a while to do this manually with the code starting from the beginning each time.

What I would like to do: Add a piece into the code so that if it searches for an ID and no results are returned, it moves on the next ID number and continues. I would still like for the ID to be in the output file, just that it will have no information in any of the other columns.

Here is the python script that @Ecaloota wrote for this task.

Any help and suggestions on how to do this would be greatly appreciated!

Ecaloota commented 3 years ago

Hi @kym834, I can do this now - could you send me the ARTG_IDs file you're using so I can see some of the errors?

kym834 commented 3 years ago

Thanks! I don't think the first one doesn't pops up until the 400th or so ID but the ID 169622 is one of the ones gives me the error I'm talking about if that helps.

03022021_ATRG-IDs_Cleaned.txt

Ecaloota commented 3 years ago

Hi @kym834. Good news, I've made some significant changes to the structure of the program, and it looks to be running fine (I've tested approx. the first 2500 IDs). Among the most pertinent features:

I've introduced a data.py file, which is just a file containing a dictionary for associating words found in the product name with probable routes and dosages. You'll need to have this new file in the same directory as the new ARTG_extractinator.py and the ARTG ID file. It exists only to improve readability, and so you can customise it easily if anything not found in that dictionary comes up.
I've changed the logic around a little bit so the output CSV file is written out as the search occurs - this way, if the script crashes for whatever reason, you only need to continue the search from the last-successful ID (though I'd recommend saving the old CSV under a new name and combining them manually - I haven't tested what happens if you try and operate on the original file after a crash - it'd probably be okay, but who knows). I should have done that from the start.
As you requested, I've added in some code to handle the situation where searching for a valid ARTG ID returns a page containing no results. This is a "soft check", so it's only looking for the phrase "0 fully matching" on the results page and I expect any minor changes to the database would render this check useless. But for now, it works, and writes a series of "None" strings to the output associated with that ID.
General code clean-up including docstrings.

Updated script package is here.

kym834 commented 3 years ago

Awesome, thank you @Ecaloota! It worked though when it came to a ID that returned no results it did not enter any none values into the output file. No matter though. We only need to know which ones they are, they do not need to be included in the database and I went through once it was finished and identified them by checking it against the list of IDs.

Ecaloota commented 3 years ago

Hi all,

I'm slowly working away at rewriting the original extraction code in a different way. I'm doing this for a few reasons including fun, but also because I found a better way to extract information from PDF documents, and because I found that the ARTG publishes so-called "Public Summary Documents" of each medicine in the ARTG register.

These documents look to be computer-generated (or, they follow a consistent format over the timeframe of the register), and so I can extract useful information automatically from them, including effective dates, start dates, sponsors, status, ingredients and strengths, dosage forms, etc. Importantly, I can collect this information with fewer assumptions than the last bit of code I wrote.

This last point brings me to an interesting point: last time I wrote this script, I assumed that the "Product name" (i.e. "ABACAVIR/LAMIVUDINE 600/300 SUN abacavir 600 mg / lamivudine 300 mg tablets bottle" for ARTG ID: 296381) contained the necessary information about medicine strengths (in this case, 600 mg abacavir, and 300 mg lamivudine).

However, reading the Public Summary document for the example ID shows that the strength of abacavir sulfate is 702.78 mg, with another line denoting that this is equivalent to 600 mg of abacavir per the Product Name. It would be easier to pull out strengths from the Product Names, which is what I was doing before, but I'm not sure which is more correct for your purposes? Also, the number we choose will likely affect detection of generic forms, if that is still something you're interesting in doing - as there are some medicines which say "1 mg" in the Product Name, but actually contain, say, 1.048 mg of the active ingredient in the PDF.

Cheers, Mitch

TheBreakingGoodProject / Essential-Medicines

Automated Extraction of Stage II Data, Australia #11