clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 101 forks source link

NER has no way to manually specify a resolution for label ambiguities #60

Closed hickst closed 8 years ago

hickst commented 8 years ago

The NER assigns labels based on an ordered set of categories. If the same text string occurs in two (or more) input files of different categories, there is apparently no way to specify the desired order of labeling, since the category hierarchy is fixed.

For example: the MITRE model identifies 'p110' and 'p85' as protein families but these strings are really nicknames and do not occur in the PFAM or InterPro protein family databases. Even if they were valid family names, however, they would still be ambiguous as they are also protein names, listed in the Uniprot protein database. Since the NER always labels proteins first, and there is no way to specify an override, these will always be labeled as proteins by the NER and treated as such by subsequent processing.

MihaiSurdeanu commented 8 years ago

@hickst: this issue overlaps with the global NER that @marcovzla is implementing. Please add these two examples to the auxiliary KB as protein families. @marcovzla's algorithm will disambiguate between families and proteins (even if they are labeled first as proteins).