explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.87k stars 4.38k forks source link

[W030] Some entities could not be aligned in the text #13533

Closed NitGS closed 4 months ago

NitGS commented 4 months ago

Hi! I tried training a custom Named Entity Recognition model using spaCy, but despite multiple trials, I get a warning telling me that there are misaligned entities in the training data that I had created.

import spacy
from spacy.training import Example
import random

nlp=spacy.load('en_core_web_sm')

training_data=[
("Hello from India", {""entities"": [(11, 15, ""GPE"")]})
]

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*other_pipes)
optimizer=nlp.create_optimizer()

losses={}
for i in range(10): #10 is the epoch value
  random.shuffle(training_data)
  for text, annotation in training_data:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotation)
    nlp.update([example], sgd = optimizer, losses=losses)

And the error generated is this. :

Warning (from warnings module):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/spacy/training/iob_utils.py", line 141
    warnings.warn(
UserWarning: [W030] Some entities could not be aligned in the text "Hello from India" with entities "[(11, 15, 'GPE')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.

The entity "India" starts from index 11 and ends at 15, yet spaCy doesn't recognise that it's a token. Any help is appreciated.

svlandeg commented 4 months ago

Hi! As this is not a bug, I'm transferring this to the discussion forum and will follow up with you there.