Closed AurelienMassiot closed 5 years ago
Thanks for the report! It looks like something is going wrong when deserializing the tokenizer:
File "tokenizer.pyx", line 399, in spacy.tokenizer.Tokenizer.from_bytes
In any case, it looks like there might be a problem with the serialization of the tokenizer on Windows. Will look into this! To help us debug: Are you using any custom tokenization rules?
Thanks for your quick answer, I am not using any custom tokenizarion rules I guess, the only things I do for training and saving the model are :
define train data, for example,
train_data = [
('Who is Shaka Khan?', {
'entities': [(7, 17, 'PERSON')]
}),
('I like London and Berlin.', {
'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
})
]
nlp = spacy.load("en")
train a NER with a function pretty similar to the example from spaCy,
def train_ner(nlp, train_data, output_dir, nb_iterations=50, dropout=0.5):
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe('ner')
# add labels
for _, annotations in train_data:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(nb_iterations):
random.shuffle(train_data)
losses = {}
for text, annotations in train_data:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=dropout, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
# Save model
if not Path(output_dir).exists():
Path(output_dir).mkdir()
nlp.to_disk(Path(output_dir))
print("model saved to: {}".format(output_dir))
Thanks – definitely looks like a serialization bug then.
The tests for this are currently incomplete, because the output of msgpack
for the tokenizer turned out to be inconsistent, which made it hard to test the way we're testing the other components (e.g. by asserting that the msgpack
before and after output are equal). But we should definitely adjust the tests to at least make sure the serialization roundtrip works, so we can test the Windows behaviour properly on Appveyor.
I built a model a week ago and successfully loaded it from my Windows 10
with spacy 2.0.7
.
Not sure what updated, I didn't run any pip installs in quite a while, but suddenly I get the same error when using spacy.load
as before.
Just to add another data point, we're seeing the same issue with spacy 2.0.11
, a custom model trained in one machine causes a TypeError: unhashable type: 'list'
error when loading it in another. Re-training the model in the second machine makes everything work, so it sounds like somehow a machine-specific "something" (?) might be getting used during serialization/deserialization? Reminded me of cookie encryption/decryption issues when a web server farm isn't configured to use the same encryption/decryption key.
Strangely enough, a third computer was able to use the same model... Trying to figure out how machine 1 and 3 match and 2 is different, I'll update the thread if I come up with something.
Anyone find a solution without adding a new data point/re training the model on the computer?
Not me, but coming back to this thread I just thought of something... in my case I'm putting the models in source control (git), so maybe the auto-handling of LF/CRLF characters is messing up the files? The machines where the models failed for us aren't mine so I can't check what their settings look like, but I'll ask the people who own them to check and try with different settings (basically, check-out as-is, commit as-is).
Yep, in my case that was the problem! I fixed it by adding a .gitattributes
file to the root of my repo, with something like this:
path/to/a/folder/with/a/spacy/model/** -text
That "unsets" the text
attribute, telling git
that it should not do CRLF conversion on any files under that path. Once that file is commited to the repo, the easiest solution is to clone the repository again. I also managed to fix the files by running rm .git/index
followed by git reset --hard origin/<my-branch>
(having the local version of <my-branch>
checked out).
I guess one last thing to consider, is that the files might have been changed by git at commit time, in which case the model might need to be retrained and commited again after adding the .gitattributes
file, so it doesn't get modified.
Hey, I too faced the same issue and this is what fixed me. Follow below steps to resolve the issue in windows platform:
tl;dr: run pip install "msgpack<0.6.0"
and you should get everything fixed. Alternatively update spaCy, with pip install spacy>=2.0.18
The issue here is that the msgpack
library has changed behaviour around this flag, use_list
, and spaCy previously wasn't pinned to a precise enough version of the library to prevent breaking changes. This means that if you install older versions of spaCy, they cease to work, because you're getting a newly released version of msgpack
that breaks our code.
To stop this happening we're now switching our dependencies to our own fork of msgpack
and other serialisation utilities, which we're shipping in a library called srsly. We have this ready to release on spacy-nightly
.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi, When I train a model with spaCy 2.0.3 on my environment 1, everything works well : I can save it, load it, use it. However when I try loading it with environment 2, I get the following error :
Environment 1 : it works
Environment 2 : it doesn't work
'EN' models are installed on both, spaCy versions are the same, could it be because of Windows ? Or do you have any ideas why I get this error ?
Thanks a lot !