Data scraping help - Githubissues

RoseRelevo commented 5 years ago

I am trying to automate the second half of the data abstraction for the institution and funding support of the included NLP studies. I've fiddled enough with bibliographic reference managers and can't just manage this as an output style.

I've identified the Medline record fields I'm interested in as well as documentation for how these fields are populated. I've also got an example of a simple version of the fully abstracted data.

Do you know who would be able to help me with this project? Not only will this save me from a bunch of cutting and pasting in googlesheets, we should be able to re-use this process to abstract the same data from any set of articles.

williamhersh commented 5 years ago

It sounds like you might need a programmer, but I am not sure who is able to help you.

eichmann commented 5 years ago

We (Iowa) maintain a full relational form of MEDLINE. We also have NIH award parsing logic to handle the remarkably variable ways in which authors cite their grants.

RoseRelevo commented 5 years ago

I think the standardized data in the [GR] field is sufficient as I'm just looking to translate the two letter codes they use. Do you have something that pulls info from that field and uses the translation tables to get back the actual name of the grantor?

Rose

On Sun, Feb 24, 2019 at 7:52 PM Dave Eichmann notifications@github.com wrote:

We (Iowa) maintain a full relational form of MEDLINE. We also have NIH award parsing logic to handle the remarkably variable ways in which authors cite their grants.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/data2health/nlp-review/issues/11#issuecomment-466862913, or mute the thread https://github.com/notifications/unsubscribe-auth/AtM4Q9kk7tVpxudk7EmBQYluDSygAG5qks5vQ14KgaJpZM4bKOca .

alexisgraves commented 5 years ago

Hi Rose,

Alexis here from Iowa. Dave @eichmann wanted me to get in touch to get some more details on what you are looking for. Are you envisioning providing a list of PMIDs and getting back the most specific institutional acronym and full name for each grant associated with the article? ex:

PMID	Grant ID	Grant Abbr.	Acronym	Name
185xxxxx	Txx CA00xxxx	CA	NCI	National Cancer Institute

We may also be able to work with other article identifiers, and can tailor the format and style of the data output to your needs.

Let me know how we can help! Alexis

RoseRelevo commented 5 years ago

That is exactly The sort of thing I'm looking for. To be very specific, If I were to provide a list of PMIDs what I would like extract data from the following fields.

Grant ID information, we just need the name of the funder, as well as any country information for non-US funders

From the [AD] field we'd like to pull: E-mail Top level domain of e-mail Department (deduplicate) Institution (deduplicate) State (deduplicate) Country (deduplicate) Any mismatch between Country and Top level domain of e-mail

And we'd like to pull from the MeSH indexing: Publication Type Grant (PT) any that are indexed with: Research Support, Non-U.S. Gov't [Publication Type] Research Support, U.S. Government [Publication Type] Research Support, American Recovery and Reinvestment Act [Publication Type] Research Support, U.S. Gov't, Non-P.H.S. [Publication Type] Research Support, U.S. Gov't, P.H.S. [Publication Type] Research Support, N.I.H., Extramural [Publication Type] Research Support, N.I.H., Intramural [Publication Type]

We would like to extract: Any mismatch between country data from ([ad] OR [gr]) and [pt] designation Any mismatch between [gr] and [pt] Research Support category, articles will be indexed at narrowest level of hierarchy

Is this something you could do? Reference Management software tools can't get this granular.

Thanks for any help!

alexisgraves commented 5 years ago

@RoseRelevo - This all sounds doable on our end! I can touch base through email to discuss specifics.

RoseRelevo commented 5 years ago

For the purposes of this review, it is quick enough for me to do it by hand. However, I would like to re-visit this as a stand alone project to be able to quickly visualize funding and team composition given a list of PMIDs

data2health / nlp-review

Data scraping help #11