HazyResearch / dd-genomics

The Genomics DeepDive project
Apache License 2.0
11 stars 6 forks source link

With noncanonical genes, gene supervision completely thrown off by random English words #175

Closed Colossus closed 9 years ago

Colossus commented 9 years ago

A few random lines from the genes table are FULL of "for", "was" etc.

I think the problem can be solved very simply by now making a distinction between all-uppercase and upper/lower case matching when discovering genes. I think this is better in general. For canonical gene names, we didn't have English clashes regardless of case; as it turns out we have clashes for noncanonical gene names (REFSEQ IDs hardly occur at all, apparently), we'll require case sensitivity and see if this fixes stuff.

amwenger commented 9 years ago

In the future, you might also consider keeping italicized words with an initial capital to be more sensitive to mouse gene name mentions. By convention:

Species Gene Protein
Human uppercase italics (e.g. SHH) uppercase (e.g. SHH)
Mouse initial capital italics (e.g. Shh) uppercase (e.g. SHH)
Zebrafish lowercase italics (e.g. shh) initial capital (e.g. shh)
chrismre commented 9 years ago

@zhangce @ajratner We need to retain font faces... Do we have this info reliably?

Colossus commented 9 years ago

@chrismre I think currently both boldface and italics are marked by underscores left and right of the word.

@amwenger There are two issues with this: 1) not everyone italicizes gene names 2) this has been in the back of @ajratner 's and my heads for a long time, but we didn't get around to it ...

ajratner commented 9 years ago

This is currently picked up by the XML parser and converted to markdown style- seems to be present for PMC-OA but not Pubmed abstracts On Fri, Sep 4, 2015 at 9:31 PM Colossus notifications@github.com wrote:

@chrismre https://github.com/chrismre I think currently both boldface and italics are marked by underscores left and right of the word.

@amwenger https://github.com/amwenger There are two issues with this: 1) not everyone italicizes gene names 2) this has been in the back of @ajratner https://github.com/ajratner 's and my heads for a long time, but we didn't get around to it ...

— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/175#issuecomment-137892049 .

zhangce commented 9 years ago

seems to be present for PMC-OA but not Pubmed abstracts

If we really need this, we should be able to get this from the html version of the Pubmed abstract dump.

Ce On Sep 4, 2015 3:46 PM, "Alex Ratner" notifications@github.com wrote:

This is currently picked up by the XML parser and converted to markdown style- seems to be present for PMC-OA but not Pubmed abstracts On Fri, Sep 4, 2015 at 9:31 PM Colossus notifications@github.com wrote:

@chrismre https://github.com/chrismre I think currently both boldface and italics are marked by underscores left and right of the word.

@amwenger https://github.com/amwenger There are two issues with this: 1) not everyone italicizes gene names 2) this has been in the back of @ajratner https://github.com/ajratner 's and my heads for a long time, but we didn't get around to it ...

— Reply to this email directly or view it on GitHub < https://github.com/HazyResearch/dd-genomics/issues/175#issuecomment-137892049

.

— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/dd-genomics/issues/175#issuecomment-137892658 .

amwenger commented 9 years ago

I agree that sticking to uppercase is good for now. Lucky for us, the human gene and protein conventions are both uppercase.

gbgbg commented 9 years ago

Sorry to be do daft. Can you remind me why this is an issue at all? The (human) gene world is a closed world. Only names and acronyms provided by us can be real human gene names (some mentions will, some mentions won't). But any mention of a word not in the finite gene list (like was, for) cannot be true. Are we trying to discover gene names or synonyms unknown to us? If we do not, but DD cannot be stopped from looking for new names / acronyms, isn't there a place where we can curtail these efforts? There must be something fundamental I'm missing / forgetting.

Colossus commented 9 years ago

The noncanonical gene name list simply contains the words "for" and "was" and perhaps a few other common words.

(I just discovered WAS is actually a canonical symbol ... however, it didn't cause problems so far. Maybe "was" is actually not that common a word in bio literature, but "for" definitely is.)