Closed gustavengstrom closed 3 years ago
The user hooks are just for the vectors, if the stanza model provides pretrained word embeddings. This doesn't affect the lexeme attributes like is_punct
.
To get the lexeme attributes, you need to reload the doc with the vocab for the correct language, instead of using Vocab()
, which is blank and doesn't know anything about the English defaults. What you want instead:
new_nlp = spacy.blank("en")
doc = Doc(new_nlp.vocab).from_disk('filename')
In terms of the user hooks, there's a good chance you might not be using them at all? It's only the values you get for token.vector
or doc.vector
that would be affected.
Here's how they are added initially, if you need to redo this:
Worked! Thanks for the super speedy reply...
nlp = spacy_stanza.load_pipeline("en")
doc = Doc(nlp.vocab).from_disk('test.spacy')
Glad to hear it's working! (Loading the whole stanza pipeline might be a bit slow if all you're trying to do is reload docs. It's the same "en"
with spacy.blank("en")
, which is much faster.)
Saving the doc raises the warning.
[W109] Unable to save user hooks while serializing the doc. Re-add any required user hooks to the doc after processing.
Reloading the doc then implies that the lexical attributes like token.is_punct are not available..Example code:
How would one go about re-adding required user hooks to the doc??