clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Beyond PubMed Open Access #468

Open MihaiSurdeanu opened 7 years ago

MihaiSurdeanu commented 7 years ago

We should aim to read open-access papers that are available from other sources beyond PubMed. I suspect most of these will be in PDF format, so @GullyAPCBurns's software becomes very handy.

Ideas for additional sources: FROM THEA NORMAN: "After we spoke I was thinking more about search sources.  I think we talked about patent databases as a source of papers and findings that could enrich your base for machine reading.  One thing I learned in a talk later the same day is that pre-prints (i.e., accelerated publications) probably do not appear in Pub-Med.  As researchers look for more rapid forms of publishing, pre-print services are growing and there are even groups like the Center for Open Science that aggregate pre-prints published by a number of other providers (https://cos.io/preprints/). Given the growth of pre-prints, I wondered if Big Mechanism is covering this landscape either via search of a pre-print aggregator like the Center for Open Science or perhaps via Google Scholar."

FROM MARCO: "I don't know if you are aware of this resource. Seems like it could be useful for us: https://core.ac.uk/ It says here: https://core.ac.uk/dataproviders that they use over 6000 journals: https://core.ac.uk/journals" Mihai's comment: the whole dump of this collection is available for download

What I suspect this effort entails:

  1. Download the papers. @hickst can probably do this.
  2. Convert all into NXML. For this we would need @GullyAPCBurns.
  3. Read all papers. Again @hickst.
johnbachman commented 7 years ago

This is a great idea. I've been talking to the Harvard library about getting access to a large amount of subscription-based content, which would all be in the form of PDFs.

Other great sources would be the Wikipedia entries for various genes; the open access biology textbooks available through NCBI; and the overview blurbs on proteins at Uniprot. One wrinkle with Uniprot is that the sentences all lack a subject, which is implicitly defined as the protein of interest. E.g., the entry on EGFR has sentences like:

"Activates at least 4 major downstream signaling cascades including the RAS-RAF-MEK-ERK, PI3 kinase-AKT, PLCgamma-PKC and STATs modules. May also activate the NF-kappa-B signaling cascade. Also directly phosphorylates other proteins like RGS16, activating its GTPase activity and probably coupling the EGF receptor signaling to the G protein-coupled receptor signaling. Also phosphorylates MUC1 and increases its interaction with SRC and CTNNB1/beta-catenin."

MihaiSurdeanu commented 7 years ago

These are great ideas. We can fix the lack of a subject: if the sentence syntactic analysis misses one, we can artificially add it from the protein name.

On Dec 20, 2016 12:57, "John A. Bachman" notifications@github.com wrote:

This is a great idea. I've been talking to the Harvard library about getting access to a large amount of subscription-based content, which would all be in the form of PDFs.

Other great sources would be the Wikipedia entries for various genes; the open access biology textbooks available through NCBI; and the overview blurbs on proteins at Uniprot. One wrinkle with Uniprot is that the sentences all lack a subject, which is implicitly defined as the protein of interest. E.g., the entry on EGFR has sentences like:

"Activates at least 4 major downstream signaling cascades including the RAS-RAF-MEK-ERK, PI3 kinase-AKT, PLCgamma-PKC and STATs modules. May also activate the NF-kappa-B signaling cascade. Also directly phosphorylates other proteins like RGS16, activating its GTPase activity and probably coupling the EGF receptor signaling to the G protein-coupled receptor signaling. Also phosphorylates MUC1 and increases its interaction with SRC and CTNNB1/beta-catenin."

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/468#issuecomment-268311679, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zoJQ8TeSGair3If_iQfDx6qr5UZNks5rKBbugaJpZM4LRIkW .

GullyAPCBurns commented 7 years ago

What were you guys thinking of in terms of the data sources? The system I have is tailored to individual journal formats so there is tweaking needed to generally get the extraction as good as it can be. Still, it would be interesting to see how this goes.

Gully

On Tue, Dec 20, 2016 at 10:42 AM, Mihai Surdeanu notifications@github.com wrote:

These are great ideas. We can fix the lack of a subject: if the sentence syntactic analysis misses one, we can artificially add it from the protein name.

On Dec 20, 2016 12:57, "John A. Bachman" notifications@github.com wrote:

This is a great idea. I've been talking to the Harvard library about getting access to a large amount of subscription-based content, which would all be in the form of PDFs.

Other great sources would be the Wikipedia entries for various genes; the open access biology textbooks available through NCBI; and the overview blurbs on proteins at Uniprot. One wrinkle with Uniprot is that the sentences all lack a subject, which is implicitly defined as the protein of interest. E.g., the entry on EGFR has sentences like:

"Activates at least 4 major downstream signaling cascades including the RAS-RAF-MEK-ERK, PI3 kinase-AKT, PLCgamma-PKC and STATs modules. May also activate the NF-kappa-B signaling cascade. Also directly phosphorylates other proteins like RGS16, activating its GTPase activity and probably coupling the EGF receptor signaling to the G protein-coupled receptor signaling. Also phosphorylates MUC1 and increases its interaction with SRC and CTNNB1/beta-catenin."

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/468#issuecomment-268311679, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zoJQ8TeSGair3If_ iQfDx6qr5UZNks5rKBbugaJpZM4LRIkW .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clulab_reach_issues_468-23issuecomment-2D268322911&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=XnK3PLyKM_4UXs9ixHYDAK0qz0z5PsYq8e1wm-lo41g&s=LFb7OefQNecJIK2SUvY2Z2If1YEzbfmcn1AztxivLXI&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAjMgpgkvvh7rGqow3S1XqLUeCtB3k2nks5rKCGVgaJpZM4LRIkW&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=XnK3PLyKM_4UXs9ixHYDAK0qz0z5PsYq8e1wm-lo41g&s=6yjIRN_VjyFHcgu7_TdMxm9Si0F1kDwC6nhTCrMmccw&e= .

hickst commented 7 years ago

To be practical, it seems like the extraction has to be tailored to individual input formats because PDF mixes so much of the formatting information and text together.

MihaiSurdeanu commented 7 years ago

My top 2 choices for PDF papers are preprints (https://cos.io/preprints/) and Core (https://core.ac.uk/). I hope they use a standard format throughout (haven't checked yet)? If they do, at the most we need 2 more parsers.

GullyAPCBurns commented 7 years ago

LAPDFText actually uses rule files to do the extraction for each formatting. So you specify what characteristics the narrative text has in a given article and where the boundaries are, and the system can usually do quite a good job of extracting them. I would like to use some machine learning to automatically classify the different blocks of text and read them in order but haven't had the bandwidth / expertise to do this easily.

Would that be something we might try? Since I think a good student could tear through this question pretty quickly.

Gully

On Tue, Dec 20, 2016 at 11:54 AM, Mihai Surdeanu notifications@github.com wrote:

My top 2 choices for PDF papers are preprints (https://cos.io/preprints/ https://urldefense.proofpoint.com/v2/url?u=https-3A__cos.io_preprints_&d=DwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=7B9fvFYB17bFl0A5E80Z6ln8gd_c125n4EH_vRoV8Fc&s=mVu66fAHtLXP6QBqDdF8kztrIajw-YGTsyzkwMDqnoU&e=) and Core (https://core.ac.uk/ https://urldefense.proofpoint.com/v2/url?u=https-3A__core.ac.uk_&d=DwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=7B9fvFYB17bFl0A5E80Z6ln8gd_c125n4EH_vRoV8Fc&s=RxJOt7-EshiVet47cHYGbNyhhlmEXr8_0fZhuhVJjVA&e= ). I hope they use a standard format throughout (haven't checked yet)? If they do, at the most we need 2 more parsers.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_clulab_reach_issues_468-23issuecomment-2D268341688&d=DwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=7B9fvFYB17bFl0A5E80Z6ln8gd_c125n4EH_vRoV8Fc&s=o4P2TiR1Mm0TXs-43lJPWF5RkU_EapOofFuQVt49nZ8&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAjMgleGTtvX3QbxpEIjOPadkIeivZyIks5rKDJYgaJpZM4LRIkW&d=DwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=7B9fvFYB17bFl0A5E80Z6ln8gd_c125n4EH_vRoV8Fc&s=DSsQbqywY7m_J7Tug_il5uuoI88y0ltFmbb4ZDLTepw&e= .

MihaiSurdeanu commented 7 years ago

I think this is a question for @johnbachman and @bgyori. Would looking at additional data be useful?

johnbachman commented 7 years ago

Catching up to this one. @MihaiSurdeanu, what do you mean by "additional data"? Certainly additional papers would be useful. However, for our current purposes, I'm not sure if we need a lot of additional metadata on paper sections, etc. Since we are currently operating on assertions rather than data (for better or worse), we're generally willing to take them from any section of the paper.

Another thing I would add is that it would be worth developing a processing pipeline specifically for Elsevier's XML format (I can find the schema and provide examples if interested) because it represents a very large chunk of full text content. On average, for a typical corpus of PMIDs we get about 18% as full text NXML content from PMC OA and PMC author's manuscripts combined, and about 11% as XML content from Elsevier. For the rest we're currently just taking abstracts.

MihaiSurdeanu commented 7 years ago

Thanks @johnbachman! Yes, schema and examples would be nice.