HazyResearch / dd-genomics

The Genomics DeepDive project
Apache License 2.0
11 stars 6 forks source link

We are learning a lot of cancer related facts - do we want them? #222

Closed gbgbg closed 9 years ago

gbgbg commented 9 years ago

An extreme would be to prune HPO at the subtree(s) that describes cancer. Not sure.

ajratner commented 9 years ago

We've put in some distant supervision rules already that should probably bias us away from some cancer facts (eg negative for "cell lines", etc) however that was always on the grounds that it was not a GP relation...

We can also obviously add this in eg to the API / some post-processing stage? On Fri, Oct 16, 2015 at 7:13 PM gbgbg notifications@github.com wrote:

An extreme would be to prune HPO at the subtree(s) that describes cancer. Not sure.

— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/222.

Colossus commented 9 years ago

This whole cancer thing is ill-suited to our whole extraction method. It appears that a HUGE body of cancer research is about finding proteins that are up- or downregulated in certain types of cancer. When labeling, tons of examples that list tumor markers in specific cancers occur. I'm not sure how to label those. If PI3K is a tumor marker for Wilm's tumor, is it associated with Wilm's tumor? I'd rather say yes, but it's not exactly what we're looking for ...

Even worse, there are tons of sentences that say: "Our findings suggest that X might possibly be used as a tumor marker for cancer Y." Who pays these people?????

More issues there:

My opinion: Let's get rid of cancer. The literature is a juggernaut, 95% is not of interest to us, and we don't know how to treat it.

Colossus commented 9 years ago

If somebody's down, let them just write an extractor for cancer tumor markers ... It might be more useful than 1000 lab techs writing papers about "might possibly could be a tumor marker for ductal cell carcinoma"

Colossus commented 9 years ago

Counting through a single page of examples from the holdout set, 20/50 sentences are about cancer.

One fascinating fact, in any case, is that the cancer literature repeats itself more often than any other type, apparently. That in a single random subset of sentences from PubMed we'd find the fact that BRCA1 causes breast cancer at least a dozen times is almost funny.