allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.7k stars 229 forks source link

How can I ensure that topic of the text is in medical domain before extracting entities ? #214

Closed kaniska closed 4 years ago

kaniska commented 4 years ago

I am analyzing tweets to check if covid19 related disease is mentioned. I observed that one tweet talks about the movie Mulan and Spacy library considers it as a Medical terminology.

Input Tweet: RT @ED92Magic: Mulan Inspired merchandise available in China at Disney Store / Disney Town https://t.co/ZniMSngtwC

Output Name: Mulan , Label: CHEMICAL CUI: C1823782, Name: MUL1 gene

Is there any functionality in spacy or any other library that can be used to first ensure that the text topic is in medical domain before Entity Extraction ?

Thanks Kaniska

dakinggg commented 4 years ago

Hm, I'm not aware of any functionality like that off the top of my head. None of the scispacy or spacy models (that I am aware of) are trained on tweets, and so there might be some challenges off the bat. One idea that comes to mind is to compare the average probability of the words in the tweet between the scispacy model (which comes from pubmed papers) and the spacy model (which comes from web text). I don't know how well this will work on short tweets, but here is an example of it working. I picked a random sentence from a medical paper, and you can see that scispacy assigns it higher probability than the spacy model does. It is log probabilities with a min value of -20. If you do the same procedure for the above tweet, both scispacy and spacy assign it the same probability, so you may be able to look for tweets that scispacy assigns higher probability than spacy does. I'm not sure how accurate this filter will be.


In [22]: nlp_sci = spacy.load('en_core_sci_md')

In [23]: doc = nlp("Improved blood-glucose control decreases the progression of diabetic microvascular disease, but the effect on macrovascular complications is unknown.")

In [24]: doc_sci = nlp_sci("Improved blood-glucose control decreases the progression of diabetic microvascular disease, but the effect on macrovascular complications is unknown.")

In [25]: probs = [t.prob for t in doc]

In [26]: probs_sci = [t.prob for t in doc_sci]

In [27]: sum(probs)/len(probs)
Out[28]: -9.265402327884328

In [28]: sum(probs_sci)/len(probs_sci)
Out[29]: -7.871467316150666

In [29]: nlp("diabetic")[0].prob
Out[30]: -13.113289833068848

In [30]: nlp_sci("diabetic")[0].prob
Out[30]: -10.021971702575684
kaniska commented 4 years ago

Thanks so much for the suggestion