explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
722 stars 60 forks source link

[W109] Unable to save user hooks while serializing the doc #71

Closed gustavengstrom closed 3 years ago

gustavengstrom commented 3 years ago

Saving the doc raises the warning. [W109] Unable to save user hooks while serializing the doc. Re-add any required user hooks to the doc after processing. Reloading the doc then implies that the lexical attributes like token.is_punct are not available..

Example code:

import spacy_stanza
import stanza
from spacy.tokens import Doc
from spacy.vocab import Vocab
stanza.download("en")
nlp = spacy_stanza.load_pipeline("en")
doc = nlp('Testing serialization.')
built_lexical_attributes = [t.is_punct for t in doc]
doc.to_disk('test.spacy')

doc = Doc(Vocab()).from_disk('test.spacy')
loaded_lexical_attributes = [t.is_punct for t in doc]

assert built_lexical_attributes==loaded_lexical_attributes

How would one go about re-adding required user hooks to the doc??

adrianeboyd commented 3 years ago

The user hooks are just for the vectors, if the stanza model provides pretrained word embeddings. This doesn't affect the lexeme attributes like is_punct.

To get the lexeme attributes, you need to reload the doc with the vocab for the correct language, instead of using Vocab(), which is blank and doesn't know anything about the English defaults. What you want instead:

new_nlp = spacy.blank("en")
doc = Doc(new_nlp.vocab).from_disk('filename')

In terms of the user hooks, there's a good chance you might not be using them at all? It's only the values you get for token.vector or doc.vector that would be affected.

Here's how they are added initially, if you need to redo this:

https://github.com/explosion/spacy-stanza/blob/a87c723a1f45b456c3488e13afb0090362016bf9/spacy_stanza/tokenizer.py#L179-L181

gustavengstrom commented 3 years ago

Worked! Thanks for the super speedy reply...

nlp = spacy_stanza.load_pipeline("en")
doc = Doc(nlp.vocab).from_disk('test.spacy')
adrianeboyd commented 3 years ago

Glad to hear it's working! (Loading the whole stanza pipeline might be a bit slow if all you're trying to do is reload docs. It's the same "en" with spacy.blank("en"), which is much faster.)