clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 100 forks source link

Need to override NER identifications #61

Closed hickst closed 8 years ago

hickst commented 8 years ago

We often know how a given entity should be labeled. Assignments from knowledge sources should be able to override the NER's default classifications.

For example: the CRF seems to be responsible for identifying 'H-RAS' and 'K-RAS' (but not 'HRAS' or 'KRAS') as protein families even though our knowledge sources list these exclusively as proteins.

hickst commented 8 years ago

Attached is a candidate sample file which could be used to implement a NER override and auxiliary grounding capability. It is based on the NMZ auxiliary spreadsheet of 1/11/16 merged and updated with the latest (6/9/16) NMZ spreadsheet model, after conversion of adhoc NMZ types to Reach types.

Note: the file extension should be .tsv but GitHub doesn't support uploads with this extension. NMZ-NER-aux_160624.txt

hickst commented 8 years ago

See additional examples in issue #60.

MihaiSurdeanu commented 8 years ago

@hickst: I added override capability to the bio NER in processors in the branch "ner-override". I also added your NMZ-NER-aux_160624.tsv.gz to bioresources. Can you please do the following:

  1. We should merge the two files you created: NMZ-NER... and NMZ-merged into a single one. There is no reason for two. Having both there is a recipe for disaster, when we update one but forget the other. I currently assume that column 0 is the entities names (if multiple words, separated by space), and column 3 is the NE label. Columns should be separated by TAB. All this is trivial to change. Let me know if you change the format.
  2. Find a better name. NMZ... is not descriptive. Also, having the date in there, is not needed given version control. I suggest something descriptive such as "NER-Grounding-Override.tsv".
  3. Add all the entities that we consistently label incorrectly according to MITRE. More on this soon.

Let me know when the first 2 are done, so we can release bioresources and processors.