Manually add a handful of articles that solve patients we have already solved

gbgbg commented 9 years ago

A+H have a set of papers that conclusively solve some of our patients. Pick a subset of these (five?), see if we have already crawled them or not. If not - put them into crawlable format. Crawl and toss in the DDG process to see that they provide the expected jackpots all the way at the end of the A+H process.

amwenger commented 9 years ago

Here are five examples of variants that we confirmed in published literature:

PMID 25724810: KMT2A:c.3464G>A in a patient with Wiedemann-Steiner syndrome.
PMID 24686847: IFIH1:c.2159G>A in a patient with Aicardi-Goutieres syndrome 7.
PMID 25394726: NOTCH3:c.6692_93insC in a patient with lateral meningocele syndrome.
PMID 25519458: FLCN:c.57_58del in a patient with Birt-Hogg-Dube syndrome
PMID 23951358: GJA1:c.716G>A in a patient with craniometaphyseal dysplasia.

Colossus commented 9 years ago

@ajratner Taking a hard look at these positive controls is somewhat of a dampener on high spirits. In fact, after looking at those, I think we should work even more with positive controls. From these papers we wouldn't have picked up any meaningful information for a couple of reasons:

We're not doing multi-sentence genepheno linking, and as it turns out, genepheno links occur far often in whole paragraphs than in single sentences.
Most diseases and symptoms are referred to by their full name only once, and after that by an abbreviation. I've been working on gene abbreviations for a long time, and the last hurdle is now inference speed, which I'm also already working on.
We're not picking up full disease names, but only stuff that is in HPO, which includes a ragbag collection or some well-known diseases (breast cancer, diabetes) but omits tons of other interesting diseases. While this was a strategical decision, I think we should move beyond HPO eventually.

Colossus commented 9 years ago

@ajratner This also has certain implications for our holdout set. While I think we should definitely move forward with the holdout set as currently planned, in a few months we might have to create a much more general holdout set that includes whole paragraphs, disease names and pheno abbreviations in particular.

One more thing is that symptom names are actually not 100% standardized. While the HPO ontology does contain a large number of synonyms for symptoms, I'm pretty sure we miss 10-20% (much more?? much less??) of symptom descriptions, and it's actually pretty hard to find out how much we miss. But this problem has to be of lower priority now, as there are areas where much more progress could be made first.

ajratner commented 9 years ago

@Colossus quick thoughts in response, more when we next chat and/or later:

I personally still think that we should get the first holdout set made already, see definitive progress on this task, get it working, and then tackle harder / different challenges, like multi-sentence
On multi-sentence, though, while on topic: one related thing here is a comparison with corpus statistics-based methods (e.g. methods which operate on counts across whole paragraph / doc). How do we do compared to these? Can we learn something / borrow something from these? But again, I think one step at a time is very important for such a big project
(a) moving beyond HPO + (b) improving recall for non-standard pheno phrasings more generally is something we've talked about and definitely is / should be on the list; again just a matter of prioritization

I still really do think though that until we are holding hard numbers that show we've conquered our current task- i.e. single sentence, semi-standard phrasings in HPO- we should just keep on this track

What is progress on starting up annotations of the holdout set? @gbgbg any thoughts on the labeling guidelines? Johannes, any new thoughts?

Colossus commented 9 years ago

@ajratner I think Gill has no further objections to the holdout set; but didn't comment so far. I'm just rerunning the full pipeline with noncanonical gene names on genomics_production, then I'll create a new holdout set, and then we can start labeling.

ajratner commented 9 years ago

Ok cool!

On Wed, Sep 9, 2015 at 4:03 PM Colossus notifications@github.com wrote:

@ajratner https://github.com/ajratner I think Gill has no further objections to the holdout set; but didn't comment so far. I'm just rerunning the full pipeline with noncanonical gene names on genomics_production, then I'll create a new holdout set, and then we can start labeling.

— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/77#issuecomment-139067922 .

Colossus commented 9 years ago

@ajratner BTW if you have something there could you perhaps clean your deepdive/out directory? The last genomics_production run crashed because we ran out of disk space on /lfs (again ... :( )

ajratner commented 9 years ago

@Colossus I just cleared out 406GB

Colossus commented 9 years ago

thanks!

HazyResearch / dd-genomics

Manually add a handful of articles that solve patients we have already solved #77