explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.09k stars 4.4k forks source link

Newest update breaks previously working annotations, throwing "Some entities could not be aligned" error #6071

Closed mefrem closed 4 years ago

mefrem commented 4 years ago

How to reproduce the behaviour

Hi team. I have a set of annotations with the starting and ending index positions for each entity relative to a document. They take the form:

{'entities': [(139, 152, 'TYPE'),
  (154, 167, 'TYPE'),
  (169, 175, 'TYPE'),
  (400, 410, 'TYPE'), etc.

These worked prior to the 2.3.2 update and now no longer work. I suspect that tokenizer is acting differently when the documents that these entities correspond to is passed to nlp.update([text], [annotations], sgd=optimizer, losses=losses)—I think maybe because of the introduction of a new whitespace handling feature. Note that the records have often multiple whitespaces and newlines (\n) in sequence.

This is the error thrown.

UserWarning: [W030] Some entities could not be aligned in the text "CHECK IN [DATE]

DATE VISITED..." with entities "[(139, 152, 'TYPE'), (154, 167, 'TYPE'), (16...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.

However when I manually check the index locations of those entities and the document, they match up.

What is causing the annotations to stop working?

Your Environment

adrianeboyd commented 4 years ago

Is this a difference between v2.3.1 and v2.3.2 or between v2.2 and v2.3? The warning itself is fairly new while the overall behavior isn't, it's just that misaligned annotations were discarded silently in the past. You're right that there's some new handling of the alignment for character offsets in v2.3, so that could be related. (And it's a warning not an error, so if you want, you can keep training despite a few misalignments. Obviously it's a good idea to take a closer look at the data, but even if you don't modify anything, the training can proceed similarly as in v2.2.)

Can you include a full example that doesn't work as expected? If necessary, you can replace sensitive info with something like xxx to keep the character offsets, although keeping the punctuation as-is and the numbers as numbers (replacing with fake values like 123 is fine) would be helpful due to interactions with the tokenizer.

(In general we'd ask that you not tag maintainers in issues here, since it can make our mentions hard to manage. We already get notifications for all the issues, I promise!)

mefrem commented 4 years ago

(Pardon my tagging you! I removed it :)

I've included a mockup text and annotation that creates the misalignment error, whereas exact same annotation and text worked prior to changes in character offsets made in v2.3 (also, could I see what those changes were? I'm having trouble locating that documentation).

Note how the 'Vitamin C' entity throws the error whereas 'cat 250' does not. We suspect it has to do with the tokenization of entities being followed by a full stop (.) and/or newline breaks.

import os
import sys
import spacy
import json
import random
from negspacy.negation import Negex
from spacy.gold import docs_to_json
from spacy.gold import GoldParse
import numpy as np

MODEL_NAME = "modelv1"
TRAINING_SET_MAX_SIZE = 10

# Hyperparameters
N_ITERATIONS = 5
BATCH_SIZE = 1

training_data = []
count = 0

t = 'X\n\n\ncat 250.\n\nTHINGS:\n1.  Vitamin C.\n2.  Man\n3.  horse.\n\nEnd.'
annotations = {'entities':[
    (4, 11, 'THING'),
    (26, 35, 'THING'),
    (41, 44, 'THING'),
    (49, 54, 'THING')
]}
data_path = 'fake_text.txt'

training_data.append((t,annotations, data_path))

n_training_documents = len(training_data)
print("Training with " + str(n_training_documents) + " documents")

nlp = spacy.load("en_core_web_md")

disabled_pipes = nlp.disable_pipes('tagger', 'parser')

if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner)
else:
    ner = nlp.get_pipe('ner')

LABELS = ['THING']
for label in LABELS:
    ner.add_label(label)

#optimizer = nlp.begin_training()
optimizer = nlp.entity.create_optimizer()
losses = {}

iterations = 0
epochs = 0

while True:
    random.shuffle(training_data)

    batches = spacy.util.minibatch(training_data, BATCH_SIZE)

    for batch in batches:

        text_list = []
        annotations_list = []
        file_paths = []
        for text, annotations, filepath in batch:
            text_list.append(text)
            annotations_list.append(text)
            file_paths.append(filepath)
        try:
            nlp.update([text], [annotations], sgd=optimizer, losses=losses)
            iterations += 1
            if iterations % 100 == 0:
                print(str(iterations), "iterations completed")

            if iterations == N_ITERATIONS:
                break
        except Exception as err:
            print("Error: " + str(err) + ": " + str(file_paths))
            sys.exit()

    epochs += 1
    print("Epoch " + str(epochs) + " completed after " + str(iterations) + " iterations")

    if iterations == N_ITERATIONS:
        break

print("Iterations: " + str(iterations))
print("Epochs: " + str(epochs))

disabled_pipes.restore()

nlp.to_disk(MODEL_NAME)

print("Done updating spacy...")
############################################# Print statements for troubleshooting
print('raw annotations:')
for i in annotations['entities']:
    print(t[i[0]:i[1]])

print()
print('annotations inside training data passed to spacy.update')
for ent in training_data[0][1]['entities']:
    print(training_data[0][0][ent[0]:ent[1]])

print()
print('And this is the attempted fix suggested in the warning message still showing same error:')
tags = spacy.gold.biluo_tags_from_offsets(nlp.make_doc(t), training_data[0][1]['entities'])
for i, val in enumerate(nlp.make_doc(t)):
    print(f'token {i}:', val, tags[i])
adrianeboyd commented 4 years ago

Thanks for the example, that's very helpful to see! This is a case where you have to look carefully at the tokenization, especially around punctuation, which is often where things don't align like you'd expect. (I'm relieved that it's not actually a bug related to whitespace.)

spacy.gold.biluo_tags_from_offsets doesn't fix anything, it just shows you more information about what's going on. There's not really a good way to automatically fix cases like this since changing the entity boundaries might change the type of entity in some cases, so to be on the safe side, the default choice is to ignore misaligned entities.

import spacy
from spacy.gold import GoldParse

t = 'X\n\n\ncat 250.\n\nTHINGS:\n1.  Vitamin C.\n2.  Man\n3.  horse.\n\nEnd.'
annotations = {'entities':[
    (4, 11, 'THING'),
    (26, 35, 'THING'),
    (41, 44, 'THING'),
    (49, 54, 'THING')
]}

nlp = spacy.blank("en")
doc = nlp(t)

gp = GoldParse(nlp(t), **annotations) # shows a warning about misaligned entities

tokens = [t.text for t in nlp.make_doc(t)]
# ['X', '\n\n\n', 'cat', '250', '.', '\n\n', 'THINGS', ':', '\n', '1', '.', ' ', 'Vitamin', 'C.', '\n', '2', '.', ' ', 'Man', '\n', '3', '.', ' ', 'horse', '.', '\n\n', 'End', '.']

biluo_tags = spacy.gold.biluo_tags_from_offsets(nlp.make_doc(t), annotations["entities"])
# ['O', 'O', 'B-THING', 'L-THING', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '-', '-', 'O', 'O', 'O', 'O', 'U-THING', 'O', 'O', 'O', 'O', 'U-THING', 'O', 'O', 'O', 'O']

for token, biluo_tag in zip(tokens, biluo_tags):
    if biluo_tag == "-":
        print(token)
# Vitamin
# C.

The two - items correspond to the misalignment, which is for Vitamin C. The problem is that C. isn't split into two tokens like in the other cases, so it can't align the end offset with the start or end of a token. The default tokenizer does this because C. is often an abbreviation like a middle initial that can be one token. You either need to modify the tokenizer settings or adjust your annotation to get cases like this to align. We couldn't automatically decide whether Vitamin or Vitamin C. is also a THING, it really depends on the annotation scheme and what isn't aligned.

In spacy v3, there will be a new option for Doc.char_span that you could use to automatically adjust character offsets like this. You can specify whether you want it to be strict (as it is now), to snap to tokens completely covered by the span (snapping to a smaller span inside the original offsets if it's misaligned) or to snap to tokens at least partially covered (snapping to a longer span).

mefrem commented 4 years ago

That's a really cool feature for the future! Looking forward to it. Is there any indication of when it will be available?

Our actual problem stems from the fact that we have a set of annotations made by another application, whose tokenizer's functionality we cannot infer/have access to. The character indexes provided by that application aren't working with spacy's nlp.update() which, we see here, splits tokens differently and so has character start and stop indexes for token(s) that are misaligned to the tokens spacy creates.

Is there a way to force spacy to create tokens out of the character indexes provided in our annotations? For instance, if we know there is an entity and token(s) at character indexes (26:35) can we tell spacy to tokenize it in a way that creates a token(s) that match that span?

ud2195 commented 4 years ago

Hi @mefrem did you try making a custom tokenizer by adding your own rules to it. for example - splitting on punctuations . These are the tokens i got for your sentence after making use of custom tokenizer that splits on punctuations. I think it will solve your problem for the time being

['X',
 '\n\n\n',
 'cat',
 '250',
 '.',
 '\n\n',
 'THINGS',
 ':',
 '\n',
 '1',
 '.',
 ' ',
 'Vitamin',
 'C',
 '.',
 '\n',
 '2',
 '.',
 ' ',
 'Man',
 '\n',
 '3',
 '.',
 ' ',
 'horse',
 '.',
 '\n\n',
 'End',
 '.']

custom tokenizer code:-

import string
punctuations = list(string.punctuation)
infixes = nlp.Defaults.suffixes
for x in punctuations:
    infixes = infixes + tuple([re.escape(x)],)

infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
adrianeboyd commented 4 years ago

It's possible to create docs with whatever tokens you want, but it's only going to be useful when you're training a model if you can implement a custom tokenizer for your pipeline that provides that same tokenization for new texts. If you've trained a model from data with a tokenization method you can't reproduce, it simply won't work well when you apply it to new texts that aren't tokenized in the same way.

mefrem commented 4 years ago

I believe I am content with closing this issue. The problem seems to not have been the character misalignment changes between versions but rather the previously-silent-now-verbose warnings enabled in the recent update. The problem is in the different tokenization and document parsing methods used between the application that created the annotations and spacy, so not really a spacy issue.

Thank you!

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.