Combining Entities Recognized by Different Models & by the AbbreviationDetector

Salamander230 commented 3 years ago

I recently encountered both spaCy and ScispaCy and so far I think ScispaCy is an awesome tool to be able to identify and link biomedical entities found in text with concepts from UMLS and other knowledge bases.

I was thinking it would be even more powerful if the entities identified by different models and by the AbbreviationDetector can be combined. This would allow the shortcomings of one model to be compensated by another model. It would also allow a model's shortcomings to be compensated by the long forms of any detected abbreviations.

For example, the identified entities in "Spinal and bulbar muscular atrophy (SBMA)" using the en_core_sci_lg model in the ScispaCy Demo are:

"Spinal"
"bulbar muscular atrophy"
"SBMA"

However, after adding the AbbreviationDetector as a pipe, we would recognize "SBMA" as an abbreviation for "Spinal and bulbar muscular atrophy", so really, the entities should be the following, but they are not corrected as such:

"Spinal and bulbar muscular atrophy"
"SBMA"

Similarly, some models may identify fragments of a phrase as separate entities while another model may recognize a whole phrase as one entity. Or, some models may recognize certain entities while other models may completely ignore them. If there is some way of consolidating entities found by different models, then a more accurate and complete list of entities will be obtained than just using any given model individually.

There are also times when a longer phrased entity is not always better, because it may yield poor matching results that are below the desired mention threshold for a given knowledge base. For example, in the ScispaCy Demo, the en_core_sci_md model identifies "inherited motor neuron disease" as an entity but gives no results satisfying the mention threshold of 0.85. On the other hand, the en_core_sci_sm model identifies "inherited" and "motor neuron disease" as separate entities, each of which have matches above the 0.85 mention threshold. Therefore, it may generally be helpful to also keep track of any related original, unconsolidated entities from each model and pick the next longest phrased entities that have matching results above the desired mention threshold.

Overall, a function with the following components would be roughly what I'm looking for:

Parameters to take in:
- The text string from which entities will be identified.
- A boolean for whether or not to identify the long forms of abbreviations as entities. (e.g., True)
- A list of the desired models to use (e.g., ["en_core_sci_sm", "en_core_sci_scibert", "en_ner_bc5cdr_md"]).
- A dictionary with any desired configurations of the scispacy linker, including the linker name (e.g., {"resolve_abbreviations": True, "filter_for_definitions": False, "no_definition_threshold": 0.85, "linker_name": "umls"})
Output: A tuple with the following two items:
- The nlp object that can be used to make the linker to the utilized knowledge base.
- A Doc object with the longest length entities that also have matches above the user's desired mention threshold.

Here is how use of the proposed function, which I call consolidated_entities_tuple might look like (This is NOT functioning code, just an example of how I imagine the functionality to be):

import spacy
import scispacy

from scispacy.linking import EntityLinker
from scispacy.abbreviation import AbbreviationDetector

def consolidated_entities_tuple(text: str, long_form_abbrev_ents: bool, model_list: list, scispacy_linker_config: dict):
     # place code for function here, likely to utilize the imported modules above
     return (nlp, doc)

text = "Spinal and bulbar muscular atrophy (SBMA) is an \
inherited motor neuron disease caused by the expansion \
of a polyglutamine tract within the androgen receptor (AR). \
SBMA can be caused by this easily."

tup = consolidated_entities_tuple(text, True, ["en_core_sci_sm", "en_core_sci_scibert", "en_ner_bc5cdr_md"], 
                                  {"resolve_abbreviations": True, "filter_for_definitions": False, 
                                   "no_definition_threshold": 0.85, "linker_name": "umls"})

nlp = tup[0]
doc = tup[1]

# Let's look at the first entity
entity = doc.ents[0]

print("Name: ", entity)
>>> Name: Spinal and bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
    print(linker.kb.cui_to_entity[umls_ent[0]])

>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
                gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0752353, Name: Atrophy, Muscular, Spinobulbar
>>> Definition: .....
>>> TUI(s): T047
>>> Aliases: (total: ?):
         ... , ... , ... , ...

>>> .....

# Now let's look at the abbreviations in the text
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
    print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation     Span       Definition
>>> SBMA         (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA         (6, 7)     Spinal and bulbar muscular atrophy
>>> AR           (29, 30)   androgen receptor

Thank you for taking the time to read this. If this sort of function already exists in ScispaCy, please let me know. Otherwise, if this sort of function or some other code that accomplishes the same thing can be added to ScispaCy, that would be awesome. I believe it can be a powerful addition to the library. Let me know your thoughts.

dakinggg commented 3 years ago

Hi, I think there are others that would like to have this function as well, but I will likely not have time to work on it in the near future. I would welcome a contribution with this function though, if you would be interested in creating a PR and some tests for it!

ulc0 commented 3 months ago

Hi, I think there are others that would like to have this function as well, but I will likely not have time to work on it in the near future. I would welcome a contribution with this function though, if you would be interested in creating a PR and some tests for it!

We have both a requirement and capacity to work on this function, but may need some guidance on the spec.

-Kate B., CDH (Databricks)

dakinggg commented 3 months ago

Hi @ulc0 I think the original issue is a reasonable description! Are there any particular areas you are looking for guidance on? If you'd like to propose a design, I'd be happy to take a look here.

allenai / scispacy

Combining Entities Recognized by Different Models & by the AbbreviationDetector #388