Closed gbgbg closed 8 years ago
Here are five examples of variants that we confirmed in published literature:
@ajratner Taking a hard look at these positive controls is somewhat of a dampener on high spirits. In fact, after looking at those, I think we should work even more with positive controls. From these papers we wouldn't have picked up any meaningful information for a couple of reasons:
@ajratner This also has certain implications for our holdout set. While I think we should definitely move forward with the holdout set as currently planned, in a few months we might have to create a much more general holdout set that includes whole paragraphs, disease names and pheno abbreviations in particular.
One more thing is that symptom names are actually not 100% standardized. While the HPO ontology does contain a large number of synonyms for symptoms, I'm pretty sure we miss 10-20% (much more?? much less??) of symptom descriptions, and it's actually pretty hard to find out how much we miss. But this problem has to be of lower priority now, as there are areas where much more progress could be made first.
@Colossus quick thoughts in response, more when we next chat and/or later:
I still really do think though that until we are holding hard numbers that show we've conquered our current task- i.e. single sentence, semi-standard phrasings in HPO- we should just keep on this track
What is progress on starting up annotations of the holdout set? @gbgbg any thoughts on the labeling guidelines? Johannes, any new thoughts?
@ajratner I think Gill has no further objections to the holdout set; but didn't comment so far. I'm just rerunning the full pipeline with noncanonical gene names on genomics_production, then I'll create a new holdout set, and then we can start labeling.
Ok cool!
On Wed, Sep 9, 2015 at 4:03 PM Colossus notifications@github.com wrote:
@ajratner https://github.com/ajratner I think Gill has no further objections to the holdout set; but didn't comment so far. I'm just rerunning the full pipeline with noncanonical gene names on genomics_production, then I'll create a new holdout set, and then we can start labeling.
— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/77#issuecomment-139067922 .
@ajratner BTW if you have something there could you perhaps clean your deepdive/out directory? The last genomics_production run crashed because we ran out of disk space on /lfs (again ... :( )
@Colossus I just cleared out 406GB
thanks!
A+H have a set of papers that conclusively solve some of our patients. Pick a subset of these (five?), see if we have already crawled them or not. If not - put them into crawlable format. Crawl and toss in the DDG process to see that they provide the expected jackpots all the way at the end of the A+H process.