HazyResearch / dd-genomics

The Genomics DeepDive project
Apache License 2.0
11 stars 6 forks source link

Adapt Parser/Schema to PubMed #76

Closed Colossus closed 9 years ago

Colossus commented 9 years ago

The parser needs to be adapted anyways. It should also output the date of the publication, however; we don't really want stuff before 2000.

Additionally, currently our sentences and sentences_input key is based on the DOI of the publication. Because the DOI effectively contains a ton of other information, there are lots of dependencies on the structure of this primary key. However, we don't have a DOI for the PubMed articles. We have --- wait for it --- a PubMed ID (surprise!). So a number of changes in the schema and around the codebase are going to be necessary, in particular if we want to keep the option of having both PMC and PubMed in the database at the same time (which we do).

Colossus commented 9 years ago

In particular, the MeSH-DOI-Pubmed pheno supervision comes to mind.

gbgbg commented 9 years ago

as it happens i am playing with NCBI's eutil efetch. on a sample size of n=1, if you ask for the document in text you get PMID, PMCID but if you ask for it in xml you also get DOI. No idea how general this is... Does it extend back to 2000 (DOIs are newer, I think) ... Not even sure in what format @Colossus got pubmed.

tail foo1 foo2 ==> foo1 <==

2397682 10.1167/iovs.15-16739 26193921 PMC4509060
</PubmedData>

==> foo2 <== activation of the mitogen-activated protein kinase (MAPK) pathway, stiffens the ECM in vitro along with upregulation of Wnt antagonists and fibrotic markers embedded in a more organized matrix, and increases the stiffness of TM tissues in vivo. These results demonstrate glucocorticoid treatment can initiate the biophysical alteration associated with increased resistance to aqueous humor outflow and the resultant increase in IOP.

PMCID: PMC4509060 [Available on 2016-01-01] PMID: 26193921 [PubMed - in process]

Colossus commented 9 years ago

@gbgbg Yeah that's what I don't know. First, I don't have a PubMed to DOI mapping. (We do have a similar mapping file, but only for PLoS PubMed IDs to DOI.) Second, I already looked for a giant PubMed to DOI file, and couldn't find it. Third, even if I found it, there's always stuff that doesn't map for some reason.

hguturu commented 9 years ago

Why did you not want publications before 2000?

Colossus commented 9 years ago

@hguturu Gill suspects that publications before 2000 don't contain enough valuable gene-phenotype relationships. I tend to agree, but what's your view?

hguturu commented 9 years ago

What is the intuition behind this idea? Additionally, unless they are going to cause problems why ignore them?

Even if gene-phenotype relationship extraction/cpu cycle drops pre-2000 it doesn't seem necessary to ignore them unless you are extremely cpu starved.

Colossus commented 9 years ago

@hguturu You're probably mostly right. I guess we first have to find out how CPU-starved we actually are. I think one goal is to be able to let the thing run through each night, since we don't want to block raiders during the day. If it turns out we actually have enough CPU and memory, then we can just process the full thing. Or select journals that we want to process rather than doing a year-based cutoff.

gbgbg commented 9 years ago

The intuition began with full body analysis: I imagined that all worthwhile facts from before say 2000, will be repeated over and over in the more modern literature. And as processing felt cpu heavy to me - why bother?

Even w abstracts: there are a number of (mostly manual) curation efforts out there. I would think they got all the (far fewer) facts from >15 years ago, and we typically seed our learning on these durations. I doubt there are many Mendel like paper buried in old literature. I suspect more that inferior technology would have led folks to a lot of associations that did not pan out and some larger fraction than today of erroneous calls.

Also ask yourself following: what was the last exome you solved where a good enough (if not indeed your best) ref (or harder on me: abstract) could not be found post 1999? Those would help away me.

This is only my intuition. I can phrase a softer rec: per our CPU limitations, start from the latest papers / abstracts, and work your way back. If we can afford to read earlier stuff, I would turn a careful eye to solid facts that were say only mentioned in the nineties but not since.

I will be the first to admit this is 100% intuition at this point (which I hope makes it even more interesting for infolab to research).

-Gill

gbgbg commented 9 years ago

I too, suspect pmid is the way to go. Not sure we can evade bulk / croned mapping to it from DOI and (worse) titles.

Pmids btw are a bit fickle - they sometime change after initial processing. I can see it in pubcrawler reports read with some lag. However that seems only a small fraction.

We should remember genomics is messy and try to find 90% solution over a 70% solution. But not bother even trying to make 100%. Chris and Alex will likely argue the beast is smart enough as it is, but that is not reason to make life easier on it in a reasonable investment.

-Gill