clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Classify Smad 1/3/5 as three different proteins #136

Open MihaiSurdeanu opened 8 years ago

herongrove commented 8 years ago

This is proving to be a chicken-and-egg problem. I would like to use the BioNLPTokenizerPostProcessor to split Smad1/3/5 into Smad 1 and Smad 2 and Smad 3. But there are quite a few proteins that have slashes in their Uniprot-provided synonyms, e.g. A2/6 and E4-ORF6/7 We clearly wouldn't want to have A2 and A6 or E4-ORF6 and E4-ORF7 in these cases. We want to split up the token when appropriate so we can look up the entities, but we don't know whether to split until we've looked up the entities.

An option would be to hardcode a regex that would look for protein classes/families that we know precede numbers, e.g. /(?i)\b(smad|kras|braf) [0-9]+\/[0-9]+/. The problem here is that there are many, many proteins that end in a digit, more than have real slashes in a single protein name. We could focus on "important" proteins, e.g. ones we have seen often in our papers of interest, and have say 100 stacked in our regex, but that's not a very satisfying solution.

Ideas, @myedibleenso or @marcovzla?

MihaiSurdeanu commented 8 years ago

Is this a question for Emek, or another biologist friend?

hickst commented 8 years ago

It sounds like the necessary information is not available at this point (BioNLPTokenizerPostProcessor). Architecturally, you need to get the information to this point or postpone the decision until the information is available. If you could implement the split as a rule and postpone the decision until after NER ran, the mention would already be labeled as a protein or not.

myedibleenso commented 8 years ago

I don't think there is a good/simple solution here. After the NER layer, you'd have to decide whether to alter the tokenization. Altering the tokenization would mean that you would need to (minimally) generate a new Sentence (parse, tags, etc), update the Document to use that Sentence (I think there is a Document.text attribute?), and restart the system.

Marco and I talked a few times about an Odin matcher for the specification of rewrite rules on Sentence fields (words, lemmas, tags, etc). Instead of returning mentions, these matchers would return a new Sentence. If we were to do it, though, it would be down the line sometime. In any case, I think the problem you've identified is a good use case to demonstrate the value of such a matcher.