explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.54k stars 4.35k forks source link

"Burkina Faso" not detected #8133

Closed joaobarcia closed 3 years ago

joaobarcia commented 3 years ago

Spacy does not detect Burkina Faso.

I work for an NGO that does data analysis on violent incidents regarding the humanitarian and education sectors. We are using Spacy to automate our data entry pipeline. It works amazing most of the times, but several of our events are in Burkina Faso and Spacy does not seem to be able to detect this country in specific. I think all other countries are detected.

How to reproduce the behaviour

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Burkina-Faso, 13 February 2020: In Moussakuy village, Sanaba department, Boucle du Mouhoun region, an unnamed school was attacked by a group of ten heavily armed men who burned electoral material and seized food from the canteen")
print(doc.ents)

Returns

(13 February 2020, Moussakuy, Sanaba, Boucle, Mouhoun, ten)

Your Environment

polm commented 3 years ago

This is deeply unfortunate but not a bug (error in code) or something we can fix directly.

The models are statistical and make mistakes. Given the training data, it would be unsurprising if Burkina Faso wasn't mentioned that frequently, and it's also two separate unusual words, which is unusual for a country name, so it's kind of hard for the model to recognize. You can read more about errors in the models in #3052.

The good news is it's detected in the medium and large models.

import spacy
nlp = spacy.load("en_core_web_md")
text = "I went to Burkina Faso for vacation."
for ent in nlp(text).ents:
    print(ent.label_, ent, sep="\t")

Because this is not a bug I'm going to move it to a discussion, please feel free to follow up there if you want.