allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.72k stars 229 forks source link

Combining Entities Recognized by Different Models & by the AbbreviationDetector #388

Open Salamander230 opened 3 years ago

Salamander230 commented 3 years ago

I recently encountered both spaCy and ScispaCy and so far I think ScispaCy is an awesome tool to be able to identify and link biomedical entities found in text with concepts from UMLS and other knowledge bases.

I was thinking it would be even more powerful if the entities identified by different models and by the AbbreviationDetector can be combined. This would allow the shortcomings of one model to be compensated by another model. It would also allow a model's shortcomings to be compensated by the long forms of any detected abbreviations.

For example, the identified entities in "Spinal and bulbar muscular atrophy (SBMA)" using the en_core_sci_lg model in the ScispaCy Demo are:

However, after adding the AbbreviationDetector as a pipe, we would recognize "SBMA" as an abbreviation for "Spinal and bulbar muscular atrophy", so really, the entities should be the following, but they are not corrected as such:

Similarly, some models may identify fragments of a phrase as separate entities while another model may recognize a whole phrase as one entity. Or, some models may recognize certain entities while other models may completely ignore them. If there is some way of consolidating entities found by different models, then a more accurate and complete list of entities will be obtained than just using any given model individually.

There are also times when a longer phrased entity is not always better, because it may yield poor matching results that are below the desired mention threshold for a given knowledge base. For example, in the ScispaCy Demo, the en_core_sci_md model identifies "inherited motor neuron disease" as an entity but gives no results satisfying the mention threshold of 0.85. On the other hand, the en_core_sci_sm model identifies "inherited" and "motor neuron disease" as separate entities, each of which have matches above the 0.85 mention threshold. Therefore, it may generally be helpful to also keep track of any related original, unconsolidated entities from each model and pick the next longest phrased entities that have matching results above the desired mention threshold.

Overall, a function with the following components would be roughly what I'm looking for:

Here is how use of the proposed function, which I call consolidated_entities_tuple might look like (This is NOT functioning code, just an example of how I imagine the functionality to be):

import spacy
import scispacy

from scispacy.linking import EntityLinker
from scispacy.abbreviation import AbbreviationDetector

def consolidated_entities_tuple(text: str, long_form_abbrev_ents: bool, model_list: list, scispacy_linker_config: dict):
     # place code for function here, likely to utilize the imported modules above
     return (nlp, doc)

text = "Spinal and bulbar muscular atrophy (SBMA) is an \
inherited motor neuron disease caused by the expansion \
of a polyglutamine tract within the androgen receptor (AR). \
SBMA can be caused by this easily."

tup = consolidated_entities_tuple(text, True, ["en_core_sci_sm", "en_core_sci_scibert", "en_ner_bc5cdr_md"], 
                                  {"resolve_abbreviations": True, "filter_for_definitions": False, 
                                   "no_definition_threshold": 0.85, "linker_name": "umls"})

nlp = tup[0]
doc = tup[1]

# Let's look at the first entity
entity = doc.ents[0]

print("Name: ", entity)
>>> Name: Spinal and bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
    print(linker.kb.cui_to_entity[umls_ent[0]])

>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
                gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0752353, Name: Atrophy, Muscular, Spinobulbar
>>> Definition: .....
>>> TUI(s): T047
>>> Aliases: (total: ?):
         ... , ... , ... , ...

>>> .....

# Now let's look at the abbreviations in the text
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
    print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation     Span       Definition
>>> SBMA         (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA         (6, 7)     Spinal and bulbar muscular atrophy
>>> AR           (29, 30)   androgen receptor

Thank you for taking the time to read this. If this sort of function already exists in ScispaCy, please let me know. Otherwise, if this sort of function or some other code that accomplishes the same thing can be added to ScispaCy, that would be awesome. I believe it can be a powerful addition to the library. Let me know your thoughts.

dakinggg commented 3 years ago

Hi, I think there are others that would like to have this function as well, but I will likely not have time to work on it in the near future. I would welcome a contribution with this function though, if you would be interested in creating a PR and some tests for it!

ulc0 commented 3 months ago

Hi, I think there are others that would like to have this function as well, but I will likely not have time to work on it in the near future. I would welcome a contribution with this function though, if you would be interested in creating a PR and some tests for it!

We have both a requirement and capacity to work on this function, but may need some guidance on the spec.

-Kate B., CDH (Databricks)

dakinggg commented 3 months ago

Hi @ulc0 I think the original issue is a reasonable description! Are there any particular areas you are looking for guidance on? If you'd like to propose a design, I'd be happy to take a look here.