In case people want to comment using github instead of docs:

Rare cancer idea from Will... sorry it’s a bit long:

Background: This idea comes from talking to Josh Sommer of the Chordoma Foundation. Josh is starting a new foundation called the Rare Cancer Research Foundation. Rare cancers, in aggregate, are the fourth leading cause of death in the United States. However, the survivability of rare cancers has flat lined (or in some cases decreased) in the past decade as opposed to more mainstream cancers such as lung, breast or prostate.

Why is this happening? Well one of the main researchers is that fewer researchers are tackling cancers that only affect a couple thousand people per year. And, Josh thinks, that these researchers don’t talk to each other and there’s considerable inefficiency in the research process.

Josh started the Chordoma Foundation which has networked researchers and given them easier access to cell lines, raised funding for research, and generally catalyzed research in the field. The results have been great (find here)... launching the first Chordoma biobank, creating international research workshops, helped started new research projects in 10 labs [there was 1 lab working on Chordoma previously] and most excitingly sponsored the lab that found a genetic variant strongly associated with Chordoma. You can generally think about the Chordoma Foundation as a “quarterback” which is helping guide and manage the research … networking individual researchers, giving them access to the tools they need, and helping move as fast as possible. So, Josh is trying to scale up the Chordoma success to all rare cancers.

Problem: Josh’s question: is there a technology that could help with scaling up? They’re currently planning to hire PhDs to act as these quarterbacks that go between each of the labs. These quarterbacks will first create a giant spreadsheet which has all of the (1) investigators working on cancer, (2) all of the reagents they’re using, (3) their contact information, (4) and all of the publications. They’ll use this as the base from which they’ll start building out the network.

Solution: We brainstormed some ideas about how to speed up their process, possibly by greping the state of the research on particular types of cancer from the web. For example, for ocular melanoma - we could build something to grep through all abstracts on the web to figure out (1) researchers that work on ocular melanoma, (2) the reagents they use (example for Chordoma), (3) their home university. Then, using mechanical turk, we could have people search for the contact information for each of these researchers. The goal would be to come up with some kind of “research graph” for quarterbacks to share with the researchers. For example, this type of information took the Chordoma Foundation hundreds of man-hours to build out... just for one rare cancer.

I chatted with my buddy Adrien Treuille (website) about this yesterday. He said he has thought about the same thing... but took the idea one step further. Adrien wanted to graph all of the chemical reactions that are being tested within each paper. Basically a graph of all the chemical pathways that are being tested. The hope would be by seeing the overall graph ... one might be able to find gaps in the current research or perhaps see new pathways that haven't been tried yet.

And as I was writing this - Josh sent me the following email:

Goals: 1) generate list of all investigators who have published on a given topic, 2) for each investigator, find contact information (email, phone, URL, etc.) 3) secondarily, count the number of publications each author has in the field, draw social network graph of all co-author relationships

Rationale: mapping the participants of a research field is critical for coordinating research within that field, facilitating collaboration and information flow among investigators, and increasing density of the social network within the field

Example methods for finding contact info: Random example article in pubmed: http://www.ncbi.nlm.nih.gov/pubmed/23124635

Notice 9 coauthors from 3 different institutions: Stanford, Vanderbilt, University of Padova. Yeom KW, Lober RM, Mobley BC, Harsh G, Vogel H, Allagio R, Pearson M, Edwards MS, Fischbein NJ.

Author: Yeom Search: yeom AND "stanford" OR "vanderbilt" OR "university of padova" First hit is http://med.stanford.edu/profiles/radiology/researcher/Kristen_Yeom/ Search "@" on page source yields 7 hits, 6th is emailE=('kyeom'+'@'+'stanford.edu') Author: Lober Search: lober AND "stanford" OR "vanderbilt" OR "university of padova" First hit is http://dura.stanford.edu/RobertLober.html Search "@" on page source yields 1 hit, not Lober Search "lober" on page source yields 6 hits, 5th is Robert Lober [roblober gmail.com] Author: Mobley Search: mobley AND "stanford" OR "vanderbilt" OR "university of padova" First hit is: http://dsresearch.stanford.edu/about/mobley.html - wrong person Second hit is: http://www.mc.vanderbilt.edu/root/vumc.php?site=vmcpathology&doc=30874 - correct person Search "@" on page source yields 4 hits, 1 and 2 are

Summary of method: search pubmed for disease for each article: download author list download affiliation list parse institutions from affiliation list text possibly find list of all research institutions for comparison For each author search last name AND all institutions store URLs for top 10? hits For each URL scrape contact info: First name Last name Phone numbers emails search for @, last name, institution domain Or use pre-built scraper? mailing addresses departments List found contact info for each author Manually verify

Mark Laabs, Josh collaborator on RCRF adds: I would add that there is an output/visualization side to the puzzle as well. The above will get us the data, but we also need to think about how to make it possible to pull that data out in as simple and effective a way as possible. Clearly part of that is just a multivariable search capability, i.e. "researchers at MD Anderson who have published on chordoma in the last year". We'd then want to be able to email lists of people based on those results. However, an ability to map co-authorship to visualize nodes of researchers that know each other already so that we could then be intentional about bridging isolated nodes could be extremely valuable.

From Eric Butter: Hi guys, To keep the project low-cost early on, consider using some APIs that sacrifice a bit of comprehensiveness in favor of quality.
I gave this problem a whack this morning :: there is a programmatic interface here but ... there might be lower-hanging fruit. Mendeley has a easy-to-use, better-structured API than PubMed, but will be less comprehensive for rare disease (mostly publicly-funded, so it /needs/ to be submitted to PubMed).
ElSevier and other publishers have very well-structured (paid) API's that are worth considering too.
Sage Bionetworks rocks; they are establishing the field's standards (using cytoscape, graphviz, etc.) They have a research collab platform -- I have not used it, but there is a lot of institutional investment in what they are doing!
I would love to join this hackathon -- give me a holler when you have a time/place!

cyrusstoller / StartX-MedIC

research graph #10

Eric (650-741-5371)