Explore data augmentation for NER robustness

Hi, I am working on a covid-19 antiviral and was spot checking antivirals in scispacy and was surprised that remdesivir is not tagged as a chemical in any of the 1,338 PubMed abstracts containing it. I'm using en_ner_bc5cdr_md to extract CHEMICAL and DISEASE entities; spacy: '3.0.4', scispacy: '0.4.0'.

As you see below, remdesivir is not tagged as a CHEMICAL when I run en_ner_bc5cdr_md in Jupyter Lab.

However, when I put the same text into your demo, I was surprised that remdesivir is found.

Questions

Wonder if the version running on demo is the same one that I used in my notebook (spacy: '3.0.4', scispacy: '0.4.0')?
Maybe remdesivir isn't found since it wasn't present in earlier training sets?
Can we expect new chemicals to be recognized (e.g., first time ever published)?
It's especially surprising that remdesivir wasn't detected as a CHEMICAL even in the following line where it's called a 'drug' from the text used in my example:

Though the drug remdesivir (RDV) is not approved by the FDA, still the "Emergency Use Authorization" (EUA) for compassionate use in severe cases is endorsed.

In the demo remdesivir is detected but only once while it is mentioned several times in that passage. Is that expected?

Thanks, vikram

1) The version on the demo is probably not the latest release version. I should check and update that. 2/3/4) First, this is a model, so inconsistent and surprising output is likely, and some memorization is likely (@DeNeutoy looks like data augmentation could help a lot here). Second, the BC5CDR corpus was annotated with specific guidelines (https://biocreative.bioinformatics.udel.edu/media/store/files/2015/bc5_CDR_data_guidelines.pdf) which you may want to read and see if they align with your expectations of what would be annotated as a chemical. Here is some output of a mix of real and made up chemical names. I don't really conclude anything from this, other than that the model is definitely using some combination of the form of the name itself and the context

In [29]: for drug_name in ["mesna", "remdesivir", "mebane", "relidate", "novila", "aspirin", "coloxal", "inovivir", "scopolamine", "entamine", "valimine", "henirin", "noonirin", "halirin"]:
    ...:     text = f"The drug {drug_name} is used to treat the virus"
    ...:     doc = nlp(text)
    ...:     print(doc.ents)
    ...: 
(mesna,)
()
(mebane,)
()
()
(aspirin,)
()
()
(scopolamine,)
(entamine,)
(valimine,)
(henirin,)
()
()

Looks like it is also sensitive to capitalization

In [56]: doc = nlp("Remdesivir is a chemical")
In [57]: doc.ents
Out[57]: (Remdesivir,)

In [58]: doc = nlp("remdesivir is a chemical")

In [59]: doc.ents
Out[59]: ()

I don't have much else to add at the moment. We were thinking about running some data augmentation experiments to try to improve the NER, but haven't done it yet (I'd be thrilled to have a contribution along those lines). 5) Definitely the model takes into account the context that the word occurs in, so it is not wholly surprising to me that the same word could be classified differently in different contexts.

allenai / scispacy

Explore data augmentation for NER robustness #336