allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.66k stars 223 forks source link

Filtering CUI/TUI returned entities? #516

Open ddofer opened 1 month ago

ddofer commented 1 month ago

When doing NER/NEL to UMLS/CUI entities, is there any way to configure the nlp pipe to exclude candidates by a predefined filtering list of CUIs or TUIs? e.g. to exclude any detected CUIs with TUI: T079 (Temporal Concept)?

Currently I'm doing it by post-hoc filtering, which is both inelegant, inneffecient, and doesn't help remove noisy detections. i.e., if the linker returns the first detected entity froma text, then post-hoc filtering to remove the TUI means I miss the relevant entities.

Current code extract:

`nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls", "max_entities_per_mention": 4, #5 "threshold":0.87 ## default is 0.8, paper mentions 0.99 as thresh })

...

EXCLUDE_TUIS_LIST = ["T079","T093"] #List of umls cui semtypes to exclude.

novel_cols_candidates_names = [] no_entities_list = []

novel_candidate_cuis = [] novel_candidate_cuis_nomenclatures = [] TUIs_list = []

for f in icu_feature_terms["name"]: print(f) doc =nlp(f) linker = nlp.get_pipe("scispacy_linker")

if len(doc.ents)>0:
    for j,entity in enumerate(doc.ents):
        print(f"Entity #{j}:{entity}")

        list_feature_cuis = [i[0] for i in entity._.kb_ents]

        ## add tui filt
        s1 = len(list_feature_cuis)
        # print(s1)
        tui_filter_mask = [linker.kb.cui_to_entity[c][3][0] not in EXCLUDE_TUIS_LIST for c in list_feature_cuis]
        list_feature_cuis = list(compress(list_feature_cuis,tui_filter_mask))

        list_cuis_nomenclatures = [linker.kb.cui_to_entity[i[0]][1] for i in entity._.kb_ents]
        # linker = nlp.get_pipe("scispacy_linker") #ORIG
        list_cuis_nomenclatures = list(compress(list_cuis_nomenclatures,tui_filter_mask))

        num_candidates = len(list_feature_cuis)
        for c in list_feature_cuis:
            TUIs_list.append(linker.kb.cui_to_entity[c][3][0]) # c[0]][3][0])

            for cui in list_feature_cuis:
              novel_cols_candidates_names.extend([f]*(num_candidates))
              novel_candidate_cuis.extend(list_feature_cuis)
              novel_candidate_cuis_nomenclatures.extend(list_cuis_nomenclatures)

else:
    no_entities_list.append(f)
    print(f"No Entity candidates for {f}")

`

dakinggg commented 3 weeks ago

Hi, this is not something exists right now, although is a reasonable feature request if you wanted to give implementing it a go! Otherwise, I recommend doing what you are doing and post hoc filtering (setting the threshold such that you get enough candidates after filtering)