Open DomHudson opened 1 month ago
It's true that the docs shouldn't really lead with the to_bytes()
example, since it's usually less useful than nlp.to_disk()
.
The different serialization functions do different things, for different contexts. The main thing to keep in mind is that initialization and deserialization of data are handled in different steps, so that you can do one without the other. The spacy.load()
function does both: it uses the config to initialize the nlp
object, and then loads in the data. The nlp.from_disk()
and nlp.from_bytes()
functions only load in data, trusting that you've set up the nlp
object correctly beforehand. The nlp.to_bytes()
and nlp.to_disk()
function give you the data that you could later load in with from_{disk/bytes}
.
Sometimes your model will need custom code in order to be loaded. For this you can make your model a Python package, and then spacy.load()
can take an entry-point that will resolve to your package. This is what we do for the built-in models: there's a package called e.g. en_core_web_sm
, and that's where it loads the model from.
Hi @honnibal ,
Thank you for your response!
If I just want to persist and load a model from disk, is this code accurate?
import spacy
nlp = spacy.load('en_core_web_lg')
nlp.to_disk('/path/to/directory')
nlp = spacy.load('/path/to/directory')
Thank you!
Summary
On this page, it claims to serialize a pipeline, you use the following methods:
and that you you must take care of storing both and then loading from disk.
However, it also appears that:
coupled with:
works and this is a lot more simple. The code executes and I can call a built nlp object on text successfully.
Questions
Does this approach actually work identically?
nlp.config
andto_bytes
seem like implementation details rather than the API for serializing?spacy.load
, should this be added?If this approach doesn't work, I think we should call this out and build a function/method that handles loading and saving to disk with a single call - this seems better than having to write your own disk persistence for the config and bytes object. What do you think?
Thanks!
Which page or section is this issue related to?
https://spacy.io/usage/saving-loading