explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.36k stars 4.41k forks source link

Italian & Spanish NER shouldn't extract "Google" or "Facebook"? #13551

Open davgargar opened 5 months ago

davgargar commented 5 months ago

Extracting entities from news articles I've realized this behavior:

image

These words are present in articles but are not extracted by the models.

Does anyone know the reason?

Info about spaCy

Siddharth-Latthe-07 commented 3 months ago

The issue might be due these reasons:-

  1. Model Training Data: spaCy's pre-trained models are trained on specific datasets. If certain entities or terms were not sufficiently represented in the training data, the model might not recognize them as entities.

  2. Model Limitations: Every model has its limitations. The pre-trained models may not always capture all entities accurately.

  3. Language Model:- The performance of entity recognition can vary between different language models. For example, the es_core_news_lg and it_core_news_lg models are specifically trained for Spanish and Italian, respectively. If the entities you are trying to extract are domain-specific or less common, these models might not perform well.

To solve the issue you may try these steps and let me know, if it works::-

  1. Custom Training Train a custom Named Entity Recognition (NER) model with your specific dataset.
  2. Data Augmentation If you have a small dataset, consider augmenting it with more examples or using transfer learning
  3. Entity Ruler Use spaCy's EntityRuler to add rules-based entity extraction.
  4. Model Evaluation and Fine-Tuning Evaluate the performance of different spaCy models and fine-tune them to better suit your needs. sample code:-
    
    import spacy
    from spacy.pipeline import EntityRuler

Load the spaCy model

nlp = spacy.load("es_core_news_lg") # or "it_core_news_lg"

Create an EntityRuler and add patterns

ruler = EntityRuler(nlp, overwrite_ents=True) patterns = [ {"label": "ORG", "pattern": "OpenAI"}, {"label": "PRODUCT", "pattern": "ChatGPT"},

Add more patterns as needed

] ruler.add_patterns(patterns)

Add the ruler to the pipeline

nlp.add_pipe(ruler, before="ner")

Process a text

doc = nlp("OpenAI has developed ChatGPT.")

Print the entities

for ent in doc.ents: print(ent.text, ent.label_)


Hope this helps, Thanks