allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

Explore data augmentation for NER robustness #336

Open vskmd opened 3 years ago

vskmd commented 3 years ago

Hi, I am working on a covid-19 antiviral and was spot checking antivirals in scispacy and was surprised that remdesivir is not tagged as a chemical in any of the 1,338 PubMed abstracts containing it. I'm using en_ner_bc5cdr_md to extract CHEMICAL and DISEASE entities; spacy: '3.0.4', scispacy: '0.4.0'.

As you see below, remdesivir is not tagged as a CHEMICAL when I run en_ner_bc5cdr_md in Jupyter Lab.

image

However, when I put the same text into your demo, I was surprised that remdesivir is found.

image

Questions

Though the drug remdesivir (RDV) is not approved by the FDA, still the "Emergency Use Authorization" (EUA) for compassionate use in severe cases is endorsed.

Thanks, vikram

dakinggg commented 3 years ago

1) The version on the demo is probably not the latest release version. I should check and update that. 2/3/4) First, this is a model, so inconsistent and surprising output is likely, and some memorization is likely (@DeNeutoy looks like data augmentation could help a lot here). Second, the BC5CDR corpus was annotated with specific guidelines (https://biocreative.bioinformatics.udel.edu/media/store/files/2015/bc5_CDR_data_guidelines.pdf) which you may want to read and see if they align with your expectations of what would be annotated as a chemical. Here is some output of a mix of real and made up chemical names. I don't really conclude anything from this, other than that the model is definitely using some combination of the form of the name itself and the context

In [29]: for drug_name in ["mesna", "remdesivir", "mebane", "relidate", "novila", "aspirin", "coloxal", "inovivir", "scopolamine", "entamine", "valimine", "henirin", "noonirin", "halirin"]:
    ...:     text = f"The drug {drug_name} is used to treat the virus"
    ...:     doc = nlp(text)
    ...:     print(doc.ents)
    ...: 
(mesna,)
()
(mebane,)
()
()
(aspirin,)
()
()
(scopolamine,)
(entamine,)
(valimine,)
(henirin,)
()
()

Looks like it is also sensitive to capitalization

In [56]: doc = nlp("Remdesivir is a chemical")
In [57]: doc.ents
Out[57]: (Remdesivir,)

In [58]: doc = nlp("remdesivir is a chemical")

In [59]: doc.ents
Out[59]: ()

I don't have much else to add at the moment. We were thinking about running some data augmentation experiments to try to improve the NER, but haven't done it yet (I'd be thrilled to have a contribution along those lines). 5) Definitely the model takes into account the context that the word occurs in, so it is not wholly surprising to me that the same word could be classified differently in different contexts.