Data4Democracy / drug-spending

Project to understand pharmaceutical spending, currently focused on US government programs.
73 stars 46 forks source link

Scrape Merck Manuals for drug names and uses #50

Closed mattgawarecki closed 7 years ago

mattgawarecki commented 7 years ago

Task

The Merck Manuals website contains a listing of drugs, mapping generic names to brand names and listing usage indications (i.e., what the drug is prescribed for) with each one. We'd like to gather this data to build on our efforts to map drugs to their uses.

Start here: Merck Manuals Professional Version - Drug Information

This issue was spun off from #14.

Things you should know

The Merck Manuals website defaults to its consumer version. To see the professional version, one must select it explicitly. Hotlinks to the professional version redirect to the consumer version unless this selection is done beforehand. This issue can be circumvented by setting the HTTP Referer header to the value http://www.merckmanuals.com/professional.

Retrieving usage indicators for a drug may prove more complex than simply getting its name. Usage indicators are contained in a modal pop-up that appears when the user clicks on a drug name. Because the modal is controlled via JavaScript, the markup containing the desired information may not be visible to a basic "naive" scraper. This modal is a definitive guide to the drug in fine-grained detail, so some substantial text parsing may also be necessary.

What we're looking for

Output from this task should be one or more data files (CSV, feather, or otherwise). In this output, the following information should be recorded for each drug: generic name, brand name, and usage indicator(s).

How this will help

A robust dataset that correlates drugs with the conditions they're used to treat will prove invaluable as we start to dig into Medicare data. With the detail the Merck Manuals provide, we may be able to provide the clearest picture to date as to trends in Medicare drug spending and create snapshots that show how the Medicare population's health has changed over time.

domingohui commented 7 years ago

I would love to help out on this one! Are we gathering all information of drugs listed here?

mattgawarecki commented 7 years ago

@domingohui I think that's the goal, but I'm not as well acquainted with this task as some others in the project. I think @TBusen is heading up work on Merck Manual, if I recall correctly.

@TBusen, do you see any opportunities to pair with @domingohui on this issue?

TBusen commented 7 years ago

Absolutely! although as you indicated, it's not as easy as just sending a get request and scraping... here's some lessons learned, where I'm at so far and some ideas on what I plan on trying next.

1) To even get to the page not using a browser (ie Python) you need to add a user agent option in your get call ie:

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"

If you don't add this you get a 403 error

2) Calling the page directly http://www.merckmanuals.com/professional/appendixes/brand-names-of-some-commonly-used-drugs results in a redirect to the root page.

a)  Because of this I just downloaded the source html page manually and scraped it.  I have completed the table from the link above that maps generic to name brand and saved it as a CSV.

3) To automate this I am experimenting with Selenium. I have it launching Firefox and navigating to the top menu. Tonight I'm going to add the click method to see if I can get the browser to go to the correct link and scrape the html, then apply my script from 2a. It sounds straight forward but so far I can't get it to recognize the correct html class to click on... feel free to play with this

4) The real prize is the JS popup window with all the detail. I tried to generate them, and download them as PDF's thinking I could use the pdftables_api to scan the PDFs to CSV. I've had good luck with this api in the past, but the site doesn't export the info the PDF's. I am going to look into screen scraping methods.

To really automate this we need to be able to navigate from the main page to the drug information page. The site expects a browser so I think Selenium or similar is needed. I'm no expert in this area so any ideas would be warmly welcomed.

Some sources that might be useful:

TBusen commented 7 years ago

I was able to navigate to the generic to brand name drug page using selenium. See my PR for what I did. Now that this is done the next step is to loop through all the drug name href tags to scrape the pop ups.

domingohui commented 7 years ago

Thanks @TBusen I didn't realize my link wans't working. Sorry about that! I will experiment with some of the options you suggested.

I don't know if you tried this already, but if you navigate to the pro version -> Drug Information -> Drugs by Name, Generic and Brand. When you click on a drug name, the website sends a request - http://www.merckmanuals.com/Custom/LexicompMonograph/MonographByName?name=[drug_name]. The response is really ugly looking, but I think that's what's responsible for the info in the popup. I think most of the information of the drug is in there. So if you have a list of drug names, then it's perfect! We can just query them one by one with this link. But the main obstacle now is to extract useful information from the response.

TBusen commented 7 years ago

that appears to be the return from the consumer page, not the pro page. They seem to have different information returning between the two pages on the same drug. I see in the network inspector where that is coming from and you're right the output is ugly, but I don't think we want the consumer view.

TBusen commented 7 years ago

good news, I think we will be able to get the data from the pop ups. I was able to pull it in by locating the first table element, clicking it and then telling Selenium to look for an element that isn't visible in the body, the pop ups main body's class name is lexi-main, then find all paragraph elements. I haven't looped through it yet since I didn't use drug name. If you want to take my latest commit where the browser code ends

chrome.find_element_by_link_text('Drugs by Name, Generic and Brand').click()

That generates the table of drug names. @domingohui I'll commit what I have if you want to try and get this to work for drugs in the table taking drug name as an input to the loop.

mattgawarecki commented 7 years ago

Hey all,

I just got a reply from Merck regarding our permission to use this data -- it's not great. :cry: See my comment on #14.

In light of Merck's respectful declination, should we continue to keep this issue open?

jenniferthompson commented 7 years ago

Under the circumstances, probably best to close it. :cry:

mattgawarecki commented 7 years ago

Closing.