explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.57k stars 4.35k forks source link

deserialization doesn't behave as I expect #7079

Closed stauntonjr closed 3 years ago

stauntonjr commented 3 years ago

How to reproduce the behaviour

import spacy

def serialize(nlp):
    bytes_data = nlp.to_bytes()
    lang = nlp.config["nlp"]["lang"]  
    pipeline = nlp.config["nlp"]["pipeline"]
    return bytes_data, lang, pipeline

def deserialize(bytes_data, lang, pipeline):
    nlp = spacy.blank(lang)
    for pipe_name in pipeline:
        nlp.add_pipe(pipe_name)
    nlp.from_bytes(bytes_data)
    return nlp

nlp0 = spacy.load('en_core_web_sm')
bytes_data, lang, pipeline = serialize(nlp0)
nlp1 = deserialize(bytes_data, lang, pipeline)

ValueError Traceback (most recent call last)

in 17 nlp0 = spacy.load('en_core_web_sm') 18 bytes_data, lang, pipeline = serialize(nlp0) ---> 19 nlp1 = deserialize(bytes_data, lang, pipeline) in deserialize(bytes_data, lang, pipeline) 12 for pipe_name in pipeline: 13 nlp.add_pipe(pipe_name) ---> 14 nlp.from_bytes(bytes_data) 15 return nlp 16 ~/tkidk/venv/lib/python3.8/site-packages/spacy/language.py in from_bytes(self, bytes_data, exclude) 1911 b, exclude=["vocab"] 1912 ) -> 1913 util.from_bytes(bytes_data, deserializers, exclude) 1914 self._link_components() 1915 return self ~/tkidk/venv/lib/python3.8/site-packages/spacy/util.py in from_bytes(bytes_data, setters, exclude) 1122 exclude: Iterable[str], 1123 ) -> None: -> 1124 return from_dict(srsly.msgpack_loads(bytes_data), setters, exclude) 1125 1126 ~/tkidk/venv/lib/python3.8/site-packages/spacy/util.py in from_dict(msg, setters, exclude) 1144 # Split to support file names like meta.json 1145 if key.split(".")[0] not in exclude and key in msg: -> 1146 setter(msg[key]) 1147 return msg 1148 ~/tkidk/venv/lib/python3.8/site-packages/spacy/language.py in (b, proc) 1908 if not hasattr(proc, "from_bytes"): 1909 continue -> 1910 deserializers[name] = lambda b, proc=proc: proc.from_bytes( 1911 b, exclude=["vocab"] 1912 ) ~/tkidk/venv/lib/python3.8/site-packages/spacy/pipeline/trainable_pipe.pyx in spacy.pipeline.trainable_pipe.TrainablePipe.from_bytes() ~/tkidk/venv/lib/python3.8/site-packages/spacy/util.py in from_bytes(bytes_data, setters, exclude) 1122 exclude: Iterable[str], 1123 ) -> None: -> 1124 return from_dict(srsly.msgpack_loads(bytes_data), setters, exclude) 1125 1126 ~/tkidk/venv/lib/python3.8/site-packages/spacy/util.py in from_dict(msg, setters, exclude) 1144 # Split to support file names like meta.json 1145 if key.split(".")[0] not in exclude and key in msg: -> 1146 setter(msg[key]) 1147 return msg 1148 ~/tkidk/venv/lib/python3.8/site-packages/spacy/pipeline/trainable_pipe.pyx in spacy.pipeline.trainable_pipe.TrainablePipe.from_bytes.load_model() ~/tkidk/venv/lib/python3.8/site-packages/thinc/model.py in from_bytes(self, bytes_data) 524 msg = srsly.msgpack_loads(bytes_data) 525 msg = convert_recursive(is_xp_array, self.ops.asarray, msg) --> 526 return self.from_dict(msg) 527 528 def from_disk(self, path: Union[Path, str]) -> "Model": ~/tkidk/venv/lib/python3.8/site-packages/thinc/model.py in from_dict(self, msg) 547 for dim, value in info["dims"].items(): 548 if value is not None: --> 549 node.set_dim(dim, value) 550 for ref, ref_index in info["refs"].items(): 551 if ref_index is None: ~/tkidk/venv/lib/python3.8/site-packages/thinc/model.py in set_dim(self, name, value) 186 if old_value is not None and old_value != value: 187 err = f"Attempt to change dimension '{name}' for model '{self.name}' from {old_value} to {value}" --> 188 raise ValueError(err) 189 self._dims[name] = value 190 ValueError: Attempt to change dimension 'nV' for model 'hashembed' from 2000 to 5000

Your Environment

python -m spacy info

============================== Info about spaCy ==============================

spaCy version 3.0.3
Location /Users/jrs/tkidk/venv/lib/python3.8/site-packages/spacy Platform macOS-10.16-x86_64-i386-64bit Python version 3.8.3
Pipelines en_core_web_sm (3.0.0)

Package Version Location


appdirs 1.4.4 appnope 0.1.2 argon2-cffi 20.1.0 arrow 0.17.0 async-generator 1.10 atomicwrites 1.4.0 atpublic 2.1.2 attrs 20.3.0 backcall 0.2.0 binaryornot 0.4.4 bleach 3.3.0 blis 0.7.4 boto3 1.17.7 botocore 1.20.7 cachetools 4.2.1 catalogue 2.0.1 certifi 2020.12.5 cffi 1.14.5 chardet 4.0.0 click 7.1.2 cloudpickle 1.6.0 colorama 0.4.4 commonmark 0.9.1 configobj 5.0.6 cookiecutter 1.7.2 cycler 0.10.0 cymem 2.0.5 cytoolz 0.11.0 decorator 4.4.2 defusedxml 0.6.0 dictdiffer 0.8.1 distlib 0.3.1 distro 1.5.0 docutils 0.16 dpath 2.0.1 dulwich 0.20.19 dvc 1.11.15 en-core-web-sm 3.0.0 entrypoints 0.3 filelock 3.0.12 Flask 1.1.2 flatten-dict 0.3.0 flufl.lock 3.2 ftfy 5.9 funcy 1.15 future 0.18.2 gitdb 4.0.5 GitPython 3.1.13 grandalf 0.6 gunicorn 20.0.4 honcho 1.0.1 idna 2.10 importlib-metadata 3.4.0 ipykernel 5.4.3 ipython 7.20.0 ipython-genutils 0.2.0 ipywidgets 7.6.3 itsdangerous 1.1.0 jedi 0.18.0 jellyfish 0.8.2 Jinja2 2.11.3 jinja2-time 0.2.0 jmespath 0.10.0 joblib 1.0.1 jsonpath-ng 1.5.2 jsonschema 3.2.0 jupyter 1.0.0 jupyter-client 6.1.11 jupyter-console 6.2.0 jupyter-core 4.7.1 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0 kiwisolver 1.3.1 mailchecker 4.0.3 MarkupSafe 1.1.1 matplotlib 3.3.4 mistune 0.8.4 more-itertools 8.7.0 murmurhash 1.0.5 nanotime 0.5.2 nbclient 0.5.2 nbconvert 6.0.7 nbformat 5.1.2 nest-asyncio 1.5.1 networkx 2.5 notebook 6.2.0 numpy 1.20.1 packaging 20.9 pandas 1.2.2 pandocfilters 1.4.3 parso 0.8.1 pathlib2 2.3.5 pathspec 0.8.1 pathy 0.3.6 pexpect 4.8.0 phonenumbers 8.12.18 pickleshare 0.7.5 Pillow 8.1.0 pip 21.0.1 pkginfo 1.7.0 plac 1.1.3 plotly 4.14.3 pluggy 0.13.1 ply 3.11 poyo 0.5.0 preshed 3.0.5 prometheus-client 0.9.0 prompt-toolkit 3.0.16 ptyprocess 0.7.0 py 1.10.0 pyasn1 0.4.8 pycparser 2.20 pydantic 1.7.3 pydot 1.4.1 pyemd 0.5.1 Pygments 2.8.0 pygtrie 2.3.2 pyparsing 2.4.7 Pyphen 0.10.0 pyrsistent 0.17.3 pytest 4.6.5 pytest-runner 5.1 python-benedict 0.23.2 python-dateutil 2.8.1 python-dotenv 0.15.0 python-fsutil 0.4.0 python-slugify 4.0.1 pytz 2021.1 PyYAML 5.4.1 pyzmq 22.0.3 qtconsole 5.0.2 QtPy 1.9.0 readme-renderer 28.0 requests 2.25.1 requests-toolbelt 0.9.1 retrying 1.3.3 rich 9.10.0 ruamel.yaml 0.16.12 ruamel.yaml.clib 0.2.2 s3transfer 0.3.4 scikit-learn 0.23.2 scipy 1.6.0 seaborn 0.11.1 Send2Trash 1.5.0 setuptools 41.2.0 shortuuid 1.0.1 shtab 1.3.4 simplejson 3.17.2 six 1.15.0 smart-open 3.0.0 smmap 3.0.5 spacy 3.0.3 spacy-legacy 3.0.1 SQLAlchemy 1.3.23 srsly 2.4.0 tabulate 0.8.7 terminado 0.9.2 testpath 0.4.4 text-unidecode 1.3 textacy 0.10.0 thinc 8.0.1 threadpoolctl 2.1.0 tkidk 0.1.0 /Users/jrs/tkidk/src toml 0.10.2 toolz 0.11.1 tornado 6.1 tox 3.15.0 tqdm 4.56.2 traitlets 5.0.5 twine 1.14.0 typer 0.3.2 typing-extensions 3.7.4.3 urllib3 1.26.3 virtualenv 20.4.2 voluptuous 0.12.1 wasabi 0.8.2 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 1.0.1 whisk 0.1.32 widgetsnbextension 3.5.1 xmltodict 0.12.0 zc.lockfile 2.0 zipp 3.4.0

honnibal commented 3 years ago

I can see how you might find the behaviour here unintuitive, but it's not broken.

What's going wrong for you is that the nlp.from_bytes() and nlp.to_bytes() functions load back state from weights, but they do not control the configuration of the NLP pipeline. You can use spacy.load() on a directory to do the steps of:

The nlp.from_bytes() and nlp.from_disk() methods are only doing that last step. Consider that the config could refer to a subclass of Language, so there's no way that calling the instance method could work, if you were using an nlp instance of the wrong type.

In the specific example you have, the difference is that the models in the en_core_web_sm pipeline have different configs from the defaults that get used when you do nlp.add_pipe(pipe_name). The model is created with different sizing for some dimensions, and then when you try to load in the parameters, it doesn't work. You could fix this by passing the config for the components in as well, from the model you serialized. The simpler solution, however, is to use the Language.from_config classmethod to setup the nlp object and its pipeline.

stauntonjr commented 3 years ago

Thanks for your response Honnibal,

I tried to replicate your instructions but I still have an issue:

import spacy
from spacy.language import Language

nlp = spacy.load('en_core_web_sm')

nlp1 = Language.from_config(nlp.config, vocab=nlp.vocab, disable=[], exclude=[], 
                             meta=nlp.meta, auto_fill=True, validate=True)

ValueError Traceback (most recent call last)

in ----> 1 nlp0 = Language.from_config(nlp.config, vocab=nlp.vocab, disable=[], exclude=[], 2 meta=nlp.meta, auto_fill=True, validate=True) ~/tkidk/venv/lib/python3.8/site-packages/spacy/language.py in from_config(cls, config, vocab, disable, exclude, meta, auto_fill, validate) 1579 config_lang = config["nlp"].get("lang") 1580 if config_lang is not None and config_lang != cls.lang: -> 1581 raise ValueError( 1582 Errors.E958.format( 1583 bad_lang_code=config["nlp"]["lang"], ValueError: [E958] Language code defined in config (en) does not match language code of current Language subclass Language (None). If you want to create an nlp object from a config, make sure to use the matching subclass with the language-specific settings and data.
honnibal commented 3 years ago

@stauntonjr Sorry I should've been clearer, it can either be Language or a subclass of it. You could amend your code to use nlp.__class__, or you could use the spacy.lang.en.English class explicitly. There's also a spacy.util.get_lang_class helper that maps the code "en" to the correct class.

stauntonjr commented 3 years ago

Sorry that's still not clear -

are you saying the config from nlp.config is not equivalent to the config that goes in as an argument to Language.from_config() ?

how would I make this work?

honnibal commented 3 years ago

It's not that the config is different, it's just that Language.from_config is a classmethod, and it creates an instance of the cls. So if you call Language.from_config you're going to get an instance of the class Language. If you call English.from_config you're going to get an instance of the class English, and if you call Chinese.from_config you'll get an instance of the class Chinese.

When you call spacy.load(), it goes and checks the config to figure out which class to fetch. Specifically it looks at the language code, and then passes that to the get_lang_class util.

What's gone wrong in your example is that the nlp = spacy.load("en_core_web_sm") has created an instance of English, and the config is specific to that subclass. You can't get back the same result from calling Language.from_config.

These sources links might clarify things further:

stauntonjr commented 3 years ago

I pip-installed the spacy-lookups-data and then ran:

import spacy
from spacy.lang.en import English

nlp = spacy.load('en_core_web_sm')

nlp1 = spacy.lang.en.English.from_config(nlp.config, vocab=nlp.vocab, disable=[], exclude=[], 
                             meta=nlp.meta, auto_fill=True, validate=True)

nlp1.initialize()
nlp1("this is a sentence.")

and get:


ValueError Traceback (most recent call last)

in 7 meta=nlp.meta, auto_fill=True, validate=True) 8 ----> 9 nlp1.initialize() 10 nlp1("this is a sentence.") ~/tkidk/venv/lib/python3.8/site-packages/spacy/language.py in initialize(self, get_examples, sgd) 1244 proc.initialize, p_settings, section="components", name=name 1245 ) -> 1246 proc.initialize(get_examples, nlp=self, **p_settings) 1247 self._link_components() 1248 self._optimizer = sgd ~/tkidk/venv/lib/python3.8/site-packages/spacy/pipeline/tagger.pyx in spacy.pipeline.tagger.Tagger.initialize() ~/tkidk/venv/lib/python3.8/site-packages/spacy/pipeline/pipe.pyx in spacy.pipeline.pipe.Pipe._require_labels() ValueError: [E143] Labels for component 'tagger' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.
honnibal commented 3 years ago

Yes that's the expected result. After nlp.initialize() you've created set up blank components that don't have any labels or training data. The error describes the situation correctly.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.