The NER is sometimes gobbling up the word "mutant" and "mutation" when tagging an NE sequence.
Here is the problem:
NER sees "mutant AKT" and tags those tokens as B-Gene_or_gene_product and I-Gene_or_gene_product.
Our NER rule produces a Mention for "mutant AKT" with the label "Gene_or_gene_product"
Misery...
Examples
mutant AKT
AKT mutant
AKT mutation
mutant K-Ras
This is a serious problem for at least two reasons:
We fail to detect the mutation, since the mutation rules look only at the context surrounding a Mention with the label BioChemicalEntity.
We fail to ground this entity.
Other terms?
Perhaps there are other terms we should avoid scarfing down...
Proposed solutions
We could handle this in several places.
Adjust the NER component somehow.
Have a dirty secret in the mkNERMentions action. We could trim the problem token off from either end of the Mention and produce a new Mention with the same label.
Modify the NER rules.
I like (3) the most, I think. Here's what I think it would look like (untested at the time of this writing):
[entity="B-somelabel" & !lemma=/^(mutant|mutation)$/]
[entity="I-somelabel" & !lemma=/^(mutant|mutation)$/]*
|
# an NE sequence should not begin with "mutant", "mutation", etc
(?<= [entity="B-somelabel" & lemma=/^(mutant|mutation)$/])
# one or more of these
[entity="I-somelabel" & !lemma=/^(mutant|mutation)$/]+
The second pattern asserts that the token preceding the match must be the beginning of an NE sequence and should have the lemma form "mutant" or "mutation". The match is composed of at least one token that is included in (but not the beginning of) an NE sequence, and does not have the lemma form "mutant" or "mutation"
Thoughts?
I think we should take care of this before the evaluation (I actually found this while testing stuff for assembly).
Problem
The NER is sometimes gobbling up the word "mutant" and "mutation" when tagging an NE sequence.
Here is the problem:
B-Gene_or_gene_product
andI-Gene_or_gene_product
.Examples
This is a serious problem for at least two reasons:
BioChemicalEntity
.Other terms?
Perhaps there are other terms we should avoid scarfing down...
Proposed solutions
We could handle this in several places.
mkNERMentions
action. We could trim the problem token off from either end of the Mention and produce a new Mention with the same label.I like (3) the most, I think. Here's what I think it would look like (untested at the time of this writing):
The second pattern asserts that the token preceding the match must be the beginning of an NE sequence and should have the lemma form "mutant" or "mutation". The match is composed of at least one token that is included in (but not the beginning of) an NE sequence, and does not have the lemma form "mutant" or "mutation"
Thoughts?
I think we should take care of this before the evaluation (I actually found this while testing stuff for assembly).