allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

"Mesh" and "Hpo" linkers give the same result #463

Closed almogmor closed 1 year ago

almogmor commented 1 year ago

Hi, I'm trying to annotate data using Scispacy. Loading "mesh" and "hpo" gives the exact same results no matter what is the input. For example: image-1 image-2 image-3

I tried on many texts and both linkers plotted the same results.

dakinggg commented 1 year ago

Hi, there are two components related to entity recognition and linking in scispacy. One is the Named Entity Recognition (NER) component, which identifies textual spans that are likely to be entities (and depending on which scipsacy model, also their broad type). This information can be accessed as you've done via doc.ents and doc.ents[0].ent_type_. The second is the Entity Linking component, which is the one you specify mesh/hpo for. That component takes in the textual spans selected by the NER component and attempts to link them to an entity from the knowledge base. That information can be accessed via doc.ents[0]._.kb_ents. Hope that helps!

almogmor commented 1 year ago

Thanks for the quick response, yes it does help. I see now that the Entities linking are different diff

But I couldn't find a way to map back from id e.g ('C0346073') to the name of the entity at the knowledge base ('mesh'/'hpo')

hrshdhgd commented 1 year ago

I have a similar question. In the above example itself, in spite of using hpo as the linker, the id returned is C0346073 instead of HP:0012329 as we'd expect from the mapping shown here. I tried go as well and yet same result. Am I missing something?

dakinggg commented 1 year ago

All of the ontology options are implemented as subsets of UMLS. We don't have any cross mapping to the root ontology identifier. You would have to get that from UMLS or another source. The entity information available from UMLS in scispacy can be accessed as in the example code

linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
    print(linker.kb.cui_to_entity[umls_ent[0]]) 
hrshdhgd commented 1 year ago

Then how do linkers like hpo and go change the output?

dakinggg commented 1 year ago

They link to subsets of UMLS that are more specific than the full UMLS. This can be useful for two reasons (at least two that come to mind) if you know that you just want entities that fall into one of those subsets, 1) the downloaded file is much smaller and memory usage is less 2) the results will be higher precision because you won't get links to entities of a different type that you are not interested in.

almogmor commented 1 year ago

Is there any way to map back 'mesh' or 'hpo' linkers back the to relevant UMLS Entities ? In other words, if I'm using the umls linker can I filter which are 'mesh' related and which are 'hpo' related ?

e.g. Screenshot 2023-01-01 212041

dakinggg commented 1 year ago

The mesh and hpo linker entities should contain the exact same information as the umls linker entities since they are just a subset.