Open MihaiSurdeanu opened 8 years ago
Is this a question for Emek, or another biologist friend?
It sounds like the necessary information is not available at this point (BioNLPTokenizerPostProcessor). Architecturally, you need to get the information to this point or postpone the decision until the information is available. If you could implement the split as a rule and postpone the decision until after NER ran, the mention would already be labeled as a protein or not.
I don't think there is a good/simple solution here. After the NER layer, you'd have to decide whether to alter the tokenization. Altering the tokenization would mean that you would need to (minimally) generate a new Sentence
(parse, tags, etc), update the Document
to use that Sentence
(I think there is a Document.text
attribute?), and restart the system.
Marco and I talked a few times about an Odin matcher for the specification of rewrite rules on Sentence
fields (words
, lemmas
, tags
, etc). Instead of returning mentions, these matchers would return a new Sentence
. If we were to do it, though, it would be down the line sometime. In any case, I think the problem you've identified is a good use case to demonstrate the value of such a matcher.
This is proving to be a chicken-and-egg problem. I would like to use the BioNLPTokenizerPostProcessor to split
Smad1/3/5
intoSmad 1 and Smad 2 and Smad 3
. But there are quite a few proteins that have slashes in their Uniprot-provided synonyms, e.g.A2/6
andE4-ORF6/7
We clearly wouldn't want to haveA2 and A6
orE4-ORF6 and E4-ORF7
in these cases. We want to split up the token when appropriate so we can look up the entities, but we don't know whether to split until we've looked up the entities.An option would be to hardcode a regex that would look for protein classes/families that we know precede numbers, e.g.
/(?i)\b(smad|kras|braf) [0-9]+\/[0-9]+/
. The problem here is that there are many, many proteins that end in a digit, more than have real slashes in a single protein name. We could focus on "important" proteins, e.g. ones we have seen often in our papers of interest, and have say 100 stacked in our regex, but that's not a very satisfying solution.Ideas, @myedibleenso or @marcovzla?