explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.27k stars 4.4k forks source link

[Documentation] Serializing Pipeline unclear #13642

Open DomHudson opened 1 month ago

DomHudson commented 1 month ago

Summary

On this page, it claims to serialize a pipeline, you use the following methods:

config = nlp.config
bytes_data = nlp.to_bytes()

and that you you must take care of storing both and then loading from disk.

However, it also appears that:

nlp.to_disk('directory_name')

coupled with:

spacy.load('directory_name')

works and this is a lot more simple. The code executes and I can call a built nlp object on text successfully.

Questions

  1. Does this approach actually work identically?

    1. If so, can we update the documentation? The nlp.config and to_bytes seem like implementation details rather than the API for serializing?
    2. I didn't see a mention on this page that you can load the persisted pipeline from disk with spacy.load, should this be added?
  2. If this approach doesn't work, I think we should call this out and build a function/method that handles loading and saving to disk with a single call - this seems better than having to write your own disk persistence for the config and bytes object. What do you think?

Thanks!

Which page or section is this issue related to?

https://spacy.io/usage/saving-loading

honnibal commented 1 month ago

It's true that the docs shouldn't really lead with the to_bytes() example, since it's usually less useful than nlp.to_disk().

The different serialization functions do different things, for different contexts. The main thing to keep in mind is that initialization and deserialization of data are handled in different steps, so that you can do one without the other. The spacy.load() function does both: it uses the config to initialize the nlp object, and then loads in the data. The nlp.from_disk() and nlp.from_bytes() functions only load in data, trusting that you've set up the nlp object correctly beforehand. The nlp.to_bytes() and nlp.to_disk() function give you the data that you could later load in with from_{disk/bytes}.

Sometimes your model will need custom code in order to be loaded. For this you can make your model a Python package, and then spacy.load() can take an entry-point that will resolve to your package. This is what we do for the built-in models: there's a package called e.g. en_core_web_sm, and that's where it loads the model from.

DomHudson commented 1 month ago

Hi @honnibal ,

Thank you for your response!

If I just want to persist and load a model from disk, is this code accurate?

import spacy
nlp = spacy.load('en_core_web_lg')

nlp.to_disk('/path/to/directory')
nlp = spacy.load('/path/to/directory')

Thank you!