Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
369 stars 95 forks source link

Solve nested entities problems by using SpanCategorizer #88

Open hungvo304ml opened 1 year ago

hungvo304ml commented 1 year ago

Using doc.spans["sc"] (SpanCategorizer) to solve the problem of overlapped tokens in nested NER for spacy. By replacing doc.ents with doc.spans["sc"], all possible entities are able to be stored without any errors. After storing all possible spans, we filter out overlapping spans before adding them to doc.ents. Here we remove overlapping spans using spacy.util.filter_spans. When spans overlap, the rule is to prefer the first longest span over shorter ones.