explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

accuracy drops after saving and loading the model. NER task #6525

Closed Zimovik007 closed 3 years ago

Zimovik007 commented 3 years ago

python: 3.7.3 spacy: 2.3.4

After each training epoch, I run the model on independent data to check how well the model is performing. After that, I save the precision of all iterations to a file.

But after saving the model and loading it, I run it on the same data and get a much worse result. What could have gone wrong? If you need more code, I can show you what is needed

def evaluate(ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot)
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores
def main(input_file=None, n_iter=300):

    scores = []

    with jsonlines.open('scorer_test.jsonl') as f:
        scorer_test = [line for line in f]

    scorer_test = [[x['text'], x['labels']] for x in scorer_test if is_need_doctype(x['text'], x['labels'])]

    if input_file and click.confirm('Do you want to update Training Data?', default=True):
        logger.info('Updating Training Data...')
        SPACY.NER_TRAINING_FILE.write_text(input_file.read_text())

    logger.info('Reading Training Data')
    train_data = Train.load_doccano_data(SPACY.NER_TRAINING_FILE)

    spacy.prefer_gpu()

    nlp = Model.load()

    nlp.tokenizer = create_custom_tokenizer(nlp)

    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    else:
        ner = nlp.get_pipe("ner")

    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    optimizer = nlp.resume_training()
    move_names = list(ner.move_names)

    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        sizes = compounding(1.0, 16.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for _ in tqdm(range(n_iter)):
            random.shuffle(train_data)
            batches = minibatch(train_data, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            logger.info(f"Losses: {losses}")

            scores.append(evaluate(nlp, scorer_test))

    with jsonlines.open('scores-' + str(int(time.time())) + '.jsonl', mode='w') as writer:
        writer.write_all(scores)

    if click.confirm('Do you want to update Model?', default=True):
        Model.save(nlp, optimizer)

custom tokenizer:

def create_custom_tokenizer(nlp):
    prefix_re = spacy.util.compile_prefix_regex(tuple([r'\d{2}\.\d{2}\.\d{4}'] + list(nlp.Defaults.prefixes)))
    infix_re = spacy.util.compile_infix_regex(tuple([r'(\.)', r'(:)', r'(\()', r'(\))'] + list(nlp.Defaults.infixes)))
    suffixes = list(nlp.Defaults.suffixes)
    suffixes.remove('\.\.+')
    suffixes.append('\.\.\.+')
    suffix_re = spacy.util.compile_suffix_regex(tuple([r'-'] + suffixes))
    return Tokenizer(nlp.vocab, nlp.Defaults.tokenizer_exceptions,
                     prefix_search = prefix_re.search, 
                     infix_finditer = infix_re.finditer,
                     suffix_search = suffix_re.search,
                     token_match=None)

custom_seg:

boundary = re.compile('^[0-9]$')
def custom_seg(doc):
    prev = doc[0].text
    length = len(doc)
    for index, token in enumerate(doc):
        is_number = token.text == '.' and boundary.match(prev) and index != (length - 1)
        if is_number or token.text in [':', ';', ',', '/', '*'] or not token.is_punct:
            next_t = index + 1
            while next_t < length:
                doc[next_t].sent_start = False
                if doc[next_t].is_space:
                    next_t += 1
                else:
                    break
        prev = token.text
    return doc

model load/save:

class Model:
    @classmethod
    def load(cls):
        logger.info('Loading model...')
        nlp = spacy.load('de_core_news_lg')
        if CUSTOM_SEG in nlp.pipe_names:
            nlp.remove_pipe(CUSTOM_SEG)
        nlp.add_pipe(custom_seg, name=CUSTOM_SEG, before='parser')
        logger.info(f'Successfully loaded {cls.get_meta(nlp)}')
        return nlp

    @classmethod
    def save(cls, nlp, optimizer=None):
        logger.info('Saving model...')
        nlp.meta['name'] = 'Registration Docs Parser'
        nlp.meta['version'] = datetime.now().strftime('%y.%m.%d %H:%M:%S')
        nlp.remove_pipe(CUSTOM_SEG)
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(SPACY.MODEL_PATH)
        logger.info(f'Successfully saved {cls.get_meta(nlp)}')

    @staticmethod
    def get_meta(nlp):
        return f'{nlp.meta["name"]} ({nlp.meta["version"]})'

but after saving, I load model like this:

def get_model():
    nlp = spacy.load('data/model')
    if 'custom_seg' in nlp.pipe_names:
        nlp.remove_pipe('custom_seg')
    nlp.add_pipe(custom_seg, name='custom_seg', before='parser')
    return nlp

Then I call the evaluate function on the same data in the scorer_test.jsonl file and get results that are much lower than what I got at each training iteration. For some labels, the results are even lower than after the first training epoch.

adrianeboyd commented 3 years ago

There is not enough information here for us to see what's going on, in particular how the models are loaded and saved. From the additional information in your similar SO question (https://stackoverflow.com/q/65087652) my first guess would be that you're not including the additional custom segmentation component when you load the model for evaluation? The NER model won't predict entities across sentence boundaries, so if the boundaries are not the same during your evaluation, it might affect the results?

It's also not clear what your custom tokenizer looks like from the code above. If it's just using custom settings for the built-in Tokenizer, then the settings are most likely saved and reloaded correctly, but otherwise, it depends on how the serialization is implemented. You can check if token_acc in the scores is different for the dev set during training and than if used separately in your evaluation.

Be aware that you may run into some issues with the existing tagger and parser components because they may not work as well with the modified tokenizer, but if the differences are minor, it may not be a big deal in the end.

Zimovik007 commented 3 years ago

@adrianeboyd I added more code here. token_acc during training and separate testing is the same: 100.0

Zimovik007 commented 3 years ago

If we talk about the change in accuracy, that is, there are several labels whose accuracy drops by 10-14%, and some by 5-10%, some labels do not lose accuracy

adrianeboyd commented 3 years ago

Nothing jumps out at me, but this is still too piecemeal for us to be able to track down what might be going on. I still suspect you're not loading the exact same model that you saved during training when you load the model for evaluation. It won't work for your custom segmenter because you haven't implemented serialization, but for a pipeline with built-in components if you compare nlp.to_bytes() for the original model and the reloaded model they should be identical. You can also just compare the hashes for serialized version to have something shorter to inspect / compare:

assert hash(nlp.to_bytes()) == hash(nlp_reloaded.to_bytes())

If you can provide a minimal working script that we can run that shows this error, we would be able to look in more detail to see there might be a bug here. (Using dummy or anonymized data and saving/loading from the current working directory would be fine.)

Zimovik007 commented 3 years ago

@adrianeboyd so the hash doesn't really match

svlandeg commented 3 years ago

Right, so that means the models aren't entirely the same - so something must be going wrong with the IO.

If you can provide a minimal working script that we can run that shows this error, we would be able to look in more detail to see there might be a bug here.

We'll really need this minimal script to be able to investigate further - just one script that runs from start to finish and exhibits the error. Otherwise it's too difficult for us to help debug.

Zimovik007 commented 3 years ago

@svlandeg @adrianeboyd Sorry for the very long delay. It was a very busy end of the year. I prepared a jupyter notebook and a small dataset for you to show the strange behavior of the model after saving and loading. Here I have uploaded a .ipynb and dataset: https://github.com/Zimovik007/spacy_strange_scenario

adrianeboyd commented 3 years ago

Thanks for the working example, now it's a lot easier to see what's going on. I really thought it would just come down to sentence segmentation differences, but in the end I did find two bugs related to serialization when looking into the details. The second problem is the one that's concretely affecting the results for the example above.

  1. Tokenizer settings

    • bug in url_match setting serialization

      The url_match=None setting isn't preserved on reloading, it's replaced with the default url_match instead. This doesn't affect your results above because your tokenization is 100%, but it affects the to_bytes comparisons, and would have an effect if your texts contained URL-like tokens.

      This is easy to fix, PR to come soon. As a workaround for <=2.3.5, you can set url_match to a regex that never matches. (I typically use something like re.compile("a^").match for this.)

    • unicode unescaping (not a bug exactly, just confusing for to_bytes comparisons)

      The tokenizer does some unicode unescaping on load (needed for python 2) that just affects the to_bytes comparison (not the actual regexes), so you have to save and reload twice for the to_bytes comparison to work as expected.

  2. Weird NER component state after adding new labels to an existing model

    There is something weird going on when you add labels to an existing model, where the component is in a different state before and after reloading even though all the serialized data is identical. After it's reloaded once, the state appears to stay stable. Before it's reloaded, it does not train as well, and you can sometimes see differences in the model predictions before and after the first reload as in your example. This is a bit tricky to reproduce, but you can see very noticeable differences in the losses while training and with particular data+settings, you can see differences in the model predictions.


For your use case, it doesn't make sense to extend the NER model in de_core_news_sm, since your data has unrelated (and to some extent conflicting) labels, plus your training data would need to include the existing labels to keep the model from forgetting them.

If you start with a new NER model, I don't think you'll see the weird behavior related to serialization because the labels are all added initially. If you do want to extend an existing NER model, you can work around the second issue by saving and reloading the model once after adding the labels. We'll see if we can figure out what's going on.

However, part of the problem in your scores comparison above is that you've disabled custom_seg and parser in the training loop, so if you run the evaluation within the loop, you'll get different results than outside the with nlp.disable_pipes() context because the NER model won't predict entities across sentence boundaries.

For more accurate training (so the NER model sees the same Doc state it would see in the real pipeline), it would be better to leave custom_seg enabled within the loop, too. You can also leave the parser enabled to train with realistic sentence boundaries, but this gets tricky because it will try to update the parser model, too. In theory, the parser should ignore training data with missing annotations, but I'm not 100% sure this works perfectly in practice. If you want to leave it enabled and stay on the safe side, one option is to just reset it to its original state at the end of every loop.

The changes would look something like this:

nlp = spacy.load("de_core_web_sm", disable=["ner"])
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
for _, annotations in train_data:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])
optimizer = nlp.begin_training()

pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
    warnings.filterwarnings("once", category=UserWarning, module='spacy')
    sizes = compounding(1.0, 16.0, 1.001)

    for _ in tqdm(range(30)):
        random.shuffle(train_data)
        batches = minibatch(train_data, size=sizes)
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
        logger.info(f"Losses: {losses}")

You can probably also just reset the parser at the end of training (since reloading it more often it will slow your training down), but I'm honestly not sure how much its performance changes as you train. I can see that the saved parser model isn't identical after a training iteration, but I'm not sure to what extent its predictions would change, hopefully it's a very small difference, if any. (I may be worrying for no reason here.)

Edited: What I said about leaving components that set sentence boundaries enabled is incorrect: the predictions of previous components are not currently used in nlp.update so this is not useful.

Zimovik007 commented 3 years ago

@adrianeboyd your advice helped me a lot, thanks, the accuracy doesn't drop after now, but another question appeared: why does sm model predict much slower than lg model?

svlandeg commented 3 years ago

Hi @Zimovik007, it's not always easy/convenient for us to follow up on multiple issues within the same thread. I'll go ahead and close this one as the original issue is resolved. If you're still running into speed issues, feel free to open a new issue describing the context in more detail - including a minimal reproducible script, the versions of spaCy and the models you're using, and the results you're getting. Thanks!

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.