Closed Zimovik007 closed 3 years ago
There is not enough information here for us to see what's going on, in particular how the models are loaded and saved. From the additional information in your similar SO question (https://stackoverflow.com/q/65087652) my first guess would be that you're not including the additional custom segmentation component when you load the model for evaluation? The NER model won't predict entities across sentence boundaries, so if the boundaries are not the same during your evaluation, it might affect the results?
It's also not clear what your custom tokenizer looks like from the code above. If it's just using custom settings for the built-in Tokenizer
, then the settings are most likely saved and reloaded correctly, but otherwise, it depends on how the serialization is implemented. You can check if token_acc
in the scores is different for the dev set during training and than if used separately in your evaluation.
Be aware that you may run into some issues with the existing tagger
and parser
components because they may not work as well with the modified tokenizer, but if the differences are minor, it may not be a big deal in the end.
@adrianeboyd I added more code here.
token_acc
during training and separate testing is the same: 100.0
If we talk about the change in accuracy, that is, there are several labels whose accuracy drops by 10-14%, and some by 5-10%, some labels do not lose accuracy
Nothing jumps out at me, but this is still too piecemeal for us to be able to track down what might be going on. I still suspect you're not loading the exact same model that you saved during training when you load the model for evaluation. It won't work for your custom segmenter because you haven't implemented serialization, but for a pipeline with built-in components if you compare nlp.to_bytes()
for the original model and the reloaded model they should be identical. You can also just compare the hashes for serialized version to have something shorter to inspect / compare:
assert hash(nlp.to_bytes()) == hash(nlp_reloaded.to_bytes())
If you can provide a minimal working script that we can run that shows this error, we would be able to look in more detail to see there might be a bug here. (Using dummy or anonymized data and saving/loading from the current working directory would be fine.)
@adrianeboyd so the hash doesn't really match
Right, so that means the models aren't entirely the same - so something must be going wrong with the IO.
If you can provide a minimal working script that we can run that shows this error, we would be able to look in more detail to see there might be a bug here.
We'll really need this minimal script to be able to investigate further - just one script that runs from start to finish and exhibits the error. Otherwise it's too difficult for us to help debug.
@svlandeg @adrianeboyd Sorry for the very long delay. It was a very busy end of the year. I prepared a jupyter notebook and a small dataset for you to show the strange behavior of the model after saving and loading. Here I have uploaded a .ipynb and dataset: https://github.com/Zimovik007/spacy_strange_scenario
Thanks for the working example, now it's a lot easier to see what's going on. I really thought it would just come down to sentence segmentation differences, but in the end I did find two bugs related to serialization when looking into the details. The second problem is the one that's concretely affecting the results for the example above.
Tokenizer settings
bug in url_match
setting serialization
The url_match=None
setting isn't preserved on reloading, it's replaced with the default url_match
instead. This doesn't affect your results above because your tokenization is 100%, but it affects the to_bytes
comparisons, and would have an effect if your texts contained URL-like tokens.
This is easy to fix, PR to come soon. As a workaround for <=2.3.5, you can set url_match
to a regex that never matches. (I typically use something like re.compile("a^").match
for this.)
unicode unescaping (not a bug exactly, just confusing for to_bytes
comparisons)
The tokenizer does some unicode unescaping on load (needed for python 2) that just affects the to_bytes
comparison (not the actual regexes), so you have to save and reload twice for the to_bytes
comparison to work as expected.
Weird NER component state after adding new labels to an existing model
There is something weird going on when you add labels to an existing model, where the component is in a different state before and after reloading even though all the serialized data is identical. After it's reloaded once, the state appears to stay stable. Before it's reloaded, it does not train as well, and you can sometimes see differences in the model predictions before and after the first reload as in your example. This is a bit tricky to reproduce, but you can see very noticeable differences in the losses while training and with particular data+settings, you can see differences in the model predictions.
For your use case, it doesn't make sense to extend the NER model in de_core_news_sm
, since your data has unrelated (and to some extent conflicting) labels, plus your training data would need to include the existing labels to keep the model from forgetting them.
If you start with a new NER model, I don't think you'll see the weird behavior related to serialization because the labels are all added initially. If you do want to extend an existing NER model, you can work around the second issue by saving and reloading the model once after adding the labels. We'll see if we can figure out what's going on.
However, part of the problem in your scores comparison above is that you've disabled custom_seg
and parser
in the training loop, so if you run the evaluation within the loop, you'll get different results than outside the with nlp.disable_pipes()
context because the NER model won't predict entities across sentence boundaries.
For more accurate training (so the NER model sees the same Doc
state it would see in the real pipeline), it would be better to leave custom_seg
enabled within the loop, too. You can also leave the parser enabled to train with realistic sentence boundaries, but this gets tricky because it will try to update the parser model, too. In theory, the parser should ignore training data with missing annotations, but I'm not 100% sure this works perfectly in practice. If you want to leave it enabled and stay on the safe side, one option is to just reset it to its original state at the end of every loop.
The changes would look something like this:
nlp = spacy.load("de_core_web_sm", disable=["ner"])
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
for _, annotations in train_data:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
optimizer = nlp.begin_training()
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
warnings.filterwarnings("once", category=UserWarning, module='spacy')
sizes = compounding(1.0, 16.0, 1.001)
for _ in tqdm(range(30)):
random.shuffle(train_data)
batches = minibatch(train_data, size=sizes)
losses = {}
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
logger.info(f"Losses: {losses}")
You can probably also just reset the parser at the end of training (since reloading it more often it will slow your training down), but I'm honestly not sure how much its performance changes as you train. I can see that the saved parser model isn't identical after a training iteration, but I'm not sure to what extent its predictions would change, hopefully it's a very small difference, if any. (I may be worrying for no reason here.)
Edited: What I said about leaving components that set sentence boundaries enabled is incorrect: the predictions of previous components are not currently used in nlp.update
so this is not useful.
@adrianeboyd your advice helped me a lot, thanks, the accuracy doesn't drop after now, but another question appeared: why does sm model predict much slower than lg model?
Hi @Zimovik007, it's not always easy/convenient for us to follow up on multiple issues within the same thread. I'll go ahead and close this one as the original issue is resolved. If you're still running into speed issues, feel free to open a new issue describing the context in more detail - including a minimal reproducible script, the versions of spaCy and the models you're using, and the results you're getting. Thanks!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
python: 3.7.3 spacy: 2.3.4
After each training epoch, I run the model on independent data to check how well the model is performing. After that, I save the precision of all iterations to a file.
But after saving the model and loading it, I run it on the same data and get a much worse result. What could have gone wrong? If you need more code, I can show you what is needed
custom tokenizer:
custom_seg:
model load/save:
but after saving, I load model like this:
Then I call the
evaluate
function on the same data in thescorer_test.jsonl
file and get results that are much lower than what I got at each training iteration. For some labels, the results are even lower than after the first training epoch.