Problem deserializing Tokenizer on Windows (spaCy 2.0.3)

AurelienMassiot commented 6 years ago

Hi, When I train a model with spaCy 2.0.3 on my environment 1, everything works well : I can save it, load it, use it. However when I try loading it with environment 2, I get the following error :

>>> spacy.load('my_model')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\spacy\__init__.py", line 19, in load
    return util.load_model(name, **overrides)
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 116, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 158, in load_model_from_path
    return nlp.from_disk(model_path)
  File "C:\Anaconda3\lib\site-packages\spacy\language.py", line 626, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 521, in from_disk
    reader(path / key)
  File "C:\Anaconda3\lib\site-packages\spacy\language.py", line 614, in <lambda>
    ('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
  File "tokenizer.pyx", line 364, in spacy.tokenizer.Tokenizer.from_disk
  File "tokenizer.pyx", line 399, in spacy.tokenizer.Tokenizer.from_bytes
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 500, in from_bytes
    msg = msgpack.loads(bytes_data, encoding='utf8')
  File "C:\Anaconda3\lib\site-packages\msgpack_numpy.py", line 187, in unpackb
    return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
  File "msgpack/_unpacker.pyx", line 139, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:2068)
TypeError: unhashable type: 'list'

Environment 1 : it works

* spaCy version      2.0.3
* Platform           Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core
* Python version     3.6.3
* Models             en

Environment 2 : it doesn't work

* spaCy version      2.0.3
* Platform           Windows-2012Server-6.2.9200-SPO
* Python version     3.6.1
* Models             en

'EN' models are installed on both, spaCy versions are the same, could it be because of Windows ? Or do you have any ideas why I get this error ?

Thanks a lot !

ines commented 6 years ago

Thanks for the report! It looks like something is going wrong when deserializing the tokenizer:

File "tokenizer.pyx", line 399, in spacy.tokenizer.Tokenizer.from_bytes

In any case, it looks like there might be a problem with the serialization of the tokenizer on Windows. Will look into this! To help us debug: Are you using any custom tokenization rules?

AurelienMassiot commented 6 years ago

Thanks for your quick answer, I am not using any custom tokenizarion rules I guess, the only things I do for training and saving the model are :

define train data, for example,

train_data = [
('Who is Shaka Khan?', {
    'entities': [(7, 17, 'PERSON')]
}),
('I like London and Berlin.', {
    'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
})
]

nlp = spacy.load("en")
train a NER with a function pretty similar to the example from spaCy,

def train_ner(nlp, train_data, output_dir, nb_iterations=50, dropout=0.5):
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe('ner')

    # add labels
    for _, annotations in train_data:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(nb_iterations):
            random.shuffle(train_data)
            losses = {}
            for text, annotations in train_data:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=dropout,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)

    # Save model
    if not Path(output_dir).exists():
        Path(output_dir).mkdir()
    nlp.to_disk(Path(output_dir))
    print("model saved to: {}".format(output_dir))

ines commented 6 years ago

Thanks – definitely looks like a serialization bug then.

The tests for this are currently incomplete, because the output of msgpack for the tokenizer turned out to be inconsistent, which made it hard to test the way we're testing the other components (e.g. by asserting that the msgpack before and after output are equal). But we should definitely adjust the tests to at least make sure the serialization roundtrip works, so we can test the Windows behaviour properly on Appveyor.

eranhirs commented 6 years ago

I built a model a week ago and successfully loaded it from my Windows 10 with spacy 2.0.7.

Not sure what updated, I didn't run any pip installs in quite a while, but suddenly I get the same error when using spacy.load as before.

alexvy86 commented 6 years ago

Just to add another data point, we're seeing the same issue with spacy 2.0.11, a custom model trained in one machine causes a TypeError: unhashable type: 'list' error when loading it in another. Re-training the model in the second machine makes everything work, so it sounds like somehow a machine-specific "something" (?) might be getting used during serialization/deserialization? Reminded me of cookie encryption/decryption issues when a web server farm isn't configured to use the same encryption/decryption key.

alexvy86 commented 6 years ago

Strangely enough, a third computer was able to use the same model... Trying to figure out how machine 1 and 3 match and 2 is different, I'll update the thread if I come up with something.

ghost commented 6 years ago

Anyone find a solution without adding a new data point/re training the model on the computer?

alexvy86 commented 6 years ago

Not me, but coming back to this thread I just thought of something... in my case I'm putting the models in source control (git), so maybe the auto-handling of LF/CRLF characters is messing up the files? The machines where the models failed for us aren't mine so I can't check what their settings look like, but I'll ask the people who own them to check and try with different settings (basically, check-out as-is, commit as-is).

alexvy86 commented 6 years ago

Yep, in my case that was the problem! I fixed it by adding a .gitattributes file to the root of my repo, with something like this:

path/to/a/folder/with/a/spacy/model/** -text

That "unsets" the text attribute, telling git that it should not do CRLF conversion on any files under that path. Once that file is commited to the repo, the easiest solution is to clone the repository again. I also managed to fix the files by running rm .git/index followed by git reset --hard origin/<my-branch> (having the local version of <my-branch> checked out).

I guess one last thing to consider, is that the files might have been changed by git at commit time, in which case the model might need to be retrained and commited again after adding the .gitattributes file, so it doesn't get modified.

sachin-s-h commented 5 years ago

Hey, I too faced the same issue and this is what fixed me. Follow below steps to resolve the issue in windows platform:

If you have cloned your repository, just delete that.
run the command in git as: git config --global core.autocrlf false
now clone your respective repository again and re-run the code

honnibal commented 5 years ago

tl;dr: run pip install "msgpack<0.6.0" and you should get everything fixed. Alternatively update spaCy, with pip install spacy>=2.0.18

The issue here is that the msgpack library has changed behaviour around this flag, use_list, and spaCy previously wasn't pinned to a precise enough version of the library to prevent breaking changes. This means that if you install older versions of spaCy, they cease to work, because you're getting a newly released version of msgpack that breaks our code.

To stop this happening we're now switching our dependencies to our own fork of msgpack and other serialisation utilities, which we're shipping in a library called srsly. We have this ready to release on spacy-nightly.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Problem deserializing Tokenizer on Windows (spaCy 2.0.3) #1634

Environment 1 : it works

Environment 2 : it doesn't work