explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Trained model changes behaviour after being loaded from disc #3487

Closed simon-larsson closed 5 years ago

simon-larsson commented 5 years ago

I have trained a custom NER model. When I save it and load it from disc it makes different predictions than if I keep it in RAM.

Everything seems to be loaded correctly, it gets the pipeline and the correct entity. It is just when it comes to making predictions that it seems to behave as a totally untrained model.

Even stranger is that if I perform training on a loaded model it will start training from where the stored model was and will perform correct predictions again.

How to reproduce the behaviour

import spacy
import random

# nlp = spacy.blank('en')   #Started with blank 'en'
nlp = spacy.load('./spacy_model')

if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)

    for _, annotations in training_data:
         for entity in annotations.get('entities'):
            ner.add_label(entity[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

with nlp.disable_pipes(*other_pipes):  # only train NER

    optimizer = nlp.begin_training()

    for i in range(20):
        print("Starting iteration " + str(i))

        random.shuffle(training_data)
        losses = {}

        for text, annotations in training_data:

            try:
                # Some samples give an exception from GoldParser for unknown reasons
                nlp.update(
                    [text],         # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,       # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            except:
                pass

        print(losses)

# Store
nlp.to_disk('./spacy_model')

# Load 
loaded_nlp = spacy.load('./spacy_model')

sample_text = 'Simon Larsson likes to play with spaCy'

doc = nlp(sample_text)

print()
print('MODEL IN MEMORY')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

doc = loaded_nlp(sample_text)

print()
print('MODEL FROM DISC')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Which for me gives the output:

...

MODEL IN MEMORY
Simon Larsson likes to 0 22 Name

MODEL FROM DISC
Simon Larsson 0 13 Graduation Year

Your Environment

honnibal commented 5 years ago

Hey,

This was a bad regression introduced in v2.1, that we fixed quickly after release. It looks like you happened to land on one of the intermediate versions. If you upgrade to v2.1.3 it should be resolved. Please let us know if it's not fixed in the new version --- we definitely want to make sure this is resolved.

(See also #3433, #3458. Resolved by d9a07a7f6ee8c)

simon-larsson commented 5 years ago

Can confirm that the model is working as expected after being saved and loaded to/from disk with both pickle and built in functions after upgrading to v2.1.3. Thank you and sorry for the duplicate!

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.