clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

The NER is overeating #97

Open myedibleenso opened 8 years ago

myedibleenso commented 8 years ago

Problem

The NER is sometimes gobbling up the word "mutant" and "mutation" when tagging an NE sequence.

Here is the problem:

  1. NER sees "mutant AKT" and tags those tokens as B-Gene_or_gene_product and I-Gene_or_gene_product.
  2. Our NER rule produces a Mention for "mutant AKT" with the label "Gene_or_gene_product"
  3. Misery...

    Examples

    • mutant AKT
    • AKT mutant
    • AKT mutation
    • mutant K-Ras

This is a serious problem for at least two reasons:

  1. We fail to detect the mutation, since the mutation rules look only at the context surrounding a Mention with the label BioChemicalEntity.
  2. We fail to ground this entity.

    Other terms?

Perhaps there are other terms we should avoid scarfing down...

Proposed solutions

We could handle this in several places.

  1. Adjust the NER component somehow.
  2. Have a dirty secret in the mkNERMentions action. We could trim the problem token off from either end of the Mention and produce a new Mention with the same label.
  3. Modify the NER rules.

I like (3) the most, I think. Here's what I think it would look like (untested at the time of this writing):

[entity="B-somelabel" & !lemma=/^(mutant|mutation)$/] 
[entity="I-somelabel" & !lemma=/^(mutant|mutation)$/]*
|
# an NE sequence should not begin with "mutant", "mutation", etc
(?<= [entity="B-somelabel" & lemma=/^(mutant|mutation)$/]) 
# one or more of these
[entity="I-somelabel" & !lemma=/^(mutant|mutation)$/]+

The second pattern asserts that the token preceding the match must be the beginning of an NE sequence and should have the lemma form "mutant" or "mutation". The match is composed of at least one token that is included in (but not the beginning of) an NE sequence, and does not have the lemma form "mutant" or "mutation"

Thoughts?

I think we should take care of this before the evaluation (I actually found this while testing stuff for assembly).

MihaiSurdeanu commented 8 years ago

Yes, I like #3 too. This is caused by the CRF's training data, so it could be fixed there as well. But this means retraining, etc. A pain.

MihaiSurdeanu commented 8 years ago

It's a summer thing.

cl4yton commented 7 years ago

Coming... summer 2017