Open Salamander230 opened 3 years ago
Hi, I think there are others that would like to have this function as well, but I will likely not have time to work on it in the near future. I would welcome a contribution with this function though, if you would be interested in creating a PR and some tests for it!
Hi, I think there are others that would like to have this function as well, but I will likely not have time to work on it in the near future. I would welcome a contribution with this function though, if you would be interested in creating a PR and some tests for it!
We have both a requirement and capacity to work on this function, but may need some guidance on the spec.
-Kate B., CDH (Databricks)
Hi @ulc0 I think the original issue is a reasonable description! Are there any particular areas you are looking for guidance on? If you'd like to propose a design, I'd be happy to take a look here.
I recently encountered both spaCy and ScispaCy and so far I think ScispaCy is an awesome tool to be able to identify and link biomedical entities found in text with concepts from UMLS and other knowledge bases.
I was thinking it would be even more powerful if the entities identified by different models and by the AbbreviationDetector can be combined. This would allow the shortcomings of one model to be compensated by another model. It would also allow a model's shortcomings to be compensated by the long forms of any detected abbreviations.
For example, the identified entities in "Spinal and bulbar muscular atrophy (SBMA)" using the
en_core_sci_lg
model in the ScispaCy Demo are:However, after adding the AbbreviationDetector as a pipe, we would recognize "SBMA" as an abbreviation for "Spinal and bulbar muscular atrophy", so really, the entities should be the following, but they are not corrected as such:
Similarly, some models may identify fragments of a phrase as separate entities while another model may recognize a whole phrase as one entity. Or, some models may recognize certain entities while other models may completely ignore them. If there is some way of consolidating entities found by different models, then a more accurate and complete list of entities will be obtained than just using any given model individually.
There are also times when a longer phrased entity is not always better, because it may yield poor matching results that are below the desired mention threshold for a given knowledge base. For example, in the ScispaCy Demo, the
en_core_sci_md
model identifies "inherited motor neuron disease" as an entity but gives no results satisfying the mention threshold of 0.85. On the other hand, theen_core_sci_sm
model identifies "inherited" and "motor neuron disease" as separate entities, each of which have matches above the 0.85 mention threshold. Therefore, it may generally be helpful to also keep track of any related original, unconsolidated entities from each model and pick the next longest phrased entities that have matching results above the desired mention threshold.Overall, a function with the following components would be roughly what I'm looking for:
Parameters to take in:
Output: A tuple with the following two items:
Here is how use of the proposed function, which I call
consolidated_entities_tuple
might look like (This is NOT functioning code, just an example of how I imagine the functionality to be):Thank you for taking the time to read this. If this sort of function already exists in ScispaCy, please let me know. Otherwise, if this sort of function or some other code that accomplishes the same thing can be added to ScispaCy, that would be awesome. I believe it can be a powerful addition to the library. Let me know your thoughts.