Closed tugcegns closed 2 years ago
Sorry you're having trouble with this, I don't think we've seen an error like this before.
To be clear, you're comitting the directory with your pipeline to Github? That's not something that's usually done, though I would expect it to work. Some things to check:
Can you give us the full stack trace of the error? It would be very helpful to know where exactly it failed.
Is the model's file size stilll roughly the same after checking it out? If you have large files it's possible they don't check out regularly from Github and can require LFS or something.
Are you checking the model out on the same machine that made it? If not is it possible there's a version mismatch of srsly or spaCy? Some version mismatches should be OK, but it's important to be aware of them, as they are a potential source of errors.
I'm adding the whole traceback here:
Traceback (most recent call last):
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/tugcegunes/.vscode/extensions/ms-python.python-2022.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/Users/tugcegunes/.vscode/extensions/ms-python.python-2022.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/Users/tugcegunes/.vscode/extensions/ms-python.python-2022.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/Users/tugcegunes/.vscode/extensions/ms-python.python-2022.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Users/tugcegunes/.vscode/extensions/ms-python.python-2022.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Users/tugcegunes/.vscode/extensions/ms-python.python-2022.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/Users/tugcegunes/Documents/GitHub/vesper-nlp/src/main.py", line 20, in <module>
RunNewData().main(news_id_to_run=news_id_to_run)
File "/Users/tugcegunes/Documents/GitHub/vesper-nlp/src/process_for_new_data.py", line 23, in __init__
self.extractor = Keyword_Extractor()
File "/Users/tugcegunes/Documents/GitHub/vesper-nlp/src/keyword_extractor.py", line 30, in __init__
self.nlp2 = spacy.load(model_file_path) # to be updated
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/spacy/__init__.py", line 54, in load
return util.load_model(
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/spacy/util.py", line 431, in load_model
return load_model_from_path(Path(name), **kwargs) # type: ignore[arg-type]
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/spacy/util.py", line 511, in load_model_from_path
return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/spacy/language.py", line 2115, in from_disk
util.from_disk(path, deserializers, exclude) # type: ignore[arg-type]
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/spacy/util.py", line 1340, in from_disk
reader(path / key)
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/spacy/language.py", line 2091, in deserialize_vocab
self.vocab.from_disk(path, exclude=exclude)
File "spacy/vocab.pyx", line 492, in spacy.vocab.Vocab.from_disk
File "spacy/vectors.pyx", line 620, in spacy.vectors.Vectors.from_disk
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/spacy/util.py", line 1340, in from_disk
reader(path / key)
File "spacy/vectors.pyx", line 591, in spacy.vectors.Vectors.from_disk.load_key2row
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/srsly/_msgpack_api.py", line 55, in read_msgpack
msg = msgpack.load(f, raw=False, use_list=use_list)
File "/Users/tugcegunes/opt/anaconda3/envs/tf_env/lib/python3.8/site-packages/srsly/msgpack/__init__.py", line 67, in unpack
return _unpack(stream, **kwargs)
File "srsly/msgpack/_unpacker.pyx", line 213, in srsly.msgpack._unpacker.unpack
File "srsly/msgpack/_unpacker.pyx", line 199, in srsly.msgpack._unpacker.unpackb
srsly.msgpack.exceptions.ExtraData: unpack(b) received extra data.
I'm not sure about the file size, the machine is same so i dont think its something with versions.
Thanks for the traceback, that makes the error clearer. It looks like it's specifically happening when deserializing key2row
, which is related to the vectors stored with the pipeline.
I tried this locally, using the ner_drugs
sample project, modified to use vectors from en_core_web_md
. After running the project, I checked the training/model-best
directory into a Github repo, then checked it out and loaded it. It worked without issue. So this doesn't seem to be a universal issue, and I'm not currently sure how to reproduce it.
Can you clarify exactly how you're checking the model into git? Can you check the output of md5sum model-dir/vocab/key2row
on the model that works and the one checked out from Github, so we can know if the file is exactly the same? It's possible this is a bug, but it could also be the file was truncated for some reason.
This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
After I trained the custom ner model, I save it using
nlp.to_disk(model_path)
. Then I load it from same path and it works fine. But the problem starts happening after I push to code with created ner model to github repository. I run the same loading function from path as inspacy.load(model_path)
and it gives this error:Do you have any idea what causes this? I tried the same process multiple times to be sure if it only happens after pushing the code and apparently yes this is the problem but I don't know how to solve it.
Your Environment