DOsinga / deep_learning_cookbook

Deep Learning Cookbox
Apache License 2.0
689 stars 338 forks source link

ValueError: string size must be a multiple of element size #37

Closed azer closed 5 years ago

azer commented 5 years ago

Hey, thanks a lot for Deep Learning Cookbook. I'm enjoying it!

Following the text similarity example notebook, I created the below Python module:

import os
from keras.utils import get_file
import gensim
import subprocess
import numpy as np
#import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
figsize(10, 10)

from sklearn.manifold import TSNE
import json
from collections import Counter
from itertools import chain

MODEL = 'GoogleNews-vectors-negative300.bin'
path = get_file(MODEL + '.gz',
                'https://deeplearning4jblob.blob.core.windows.net/resources/wordvectors/%s.gz' % MODEL)

if not os.path.isdir('generated'): os.mkdir('generated')

unzipped = os.path.join('generated', MODEL)
if not os.path.isfile(unzipped):
    with open(unzipped, 'wb') as fout:
        zcat = subprocess.Popen(['zcat'], stdin=open(path), stdout=fout )
        zcat.wait()

model = gensim.models.KeyedVectors.load_word2vec_format(unzipped, binary=True)

print(model.most_similar(positive=['espresso']))

And I get following error:

(venv3) λ  python3.6 most_similar.py
Using TensorFlow backend.
Traceback (most recent call last):
  File "most_similar.py", line 27, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(unzipped, binary=True)
  File "/home/azer/Projects/deep-learning-exercises/venv3/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/home/azer/Projects/deep-learning-exercises/venv3/lib/python3.6/site-packages/gensim/models/utils_any2vec.py", line 212, in _load_word2vec_format
    weights = fromstring(fin.read(binary_len), dtype=REAL).astype(datatype)
ValueError: string size must be a multiple of element size
(venv3) λ  python3.6 most_similar.py
Using TensorFlow backend.
Traceback (most recent call last):
  File "most_similar.py", line 27, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(unzipped, binary=True)
  File "/home/azer/Projects/deep-learning-exercises/venv3/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "/home/azer/Projects/deep-learning-exercises/venv3/lib/python3.6/site-packages/gensim/models/utils_any2vec.py", line 212, in _load_word2vec_format
    weights = fromstring(fin.read(binary_len), dtype=REAL).astype(datatype)
ValueError: string size must be a multiple of element size

I searched Google for similar errors, but couldn't find anything helpful. Any ideas on it ?

P.S Same code also fails on Jupyter Notebook, screenshot:

DOsinga commented 5 years ago

I have not seen that error before. All I can think of is that the file downloaded is somehow corrupt and therefore loading it using word2vec fails. You could try to delete it and redownload

azer commented 5 years ago

Right, the file was corrupt. Closing the issue!