Kyubyong / wordvectors

Pre-trained word vectors of 30+ languages
MIT License
2.22k stars 391 forks source link

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte #12

Open liwzhi opened 6 years ago

liwzhi commented 6 years ago

Hi,

I am trying to load Chinese pretrained word2vec, word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) # C binary format

it throws this error.

wiwengweng commented 6 years ago

of cause the vector should be trained using the proper codec, it seems the model is trained in other coding environment. Can you check that.

lxw0109 commented 6 years ago

I have come across the same error, anybody help? Thank you ~

galuhsahid commented 6 years ago

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

lxw0109 commented 6 years ago

@galuhsahid Thank you so much, it works now. : )

anavaldi commented 6 years ago

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

Priya22 commented 6 years ago

Same error as @anavaldi . Any solution?

anavaldi commented 6 years ago

I solve this error by executing on my own word embeddings with the .sh file.

hinanmu commented 6 years ago

I have come across the same error. I changed gensim.models.KeyedVectors.load_word2vec_format() into gensim.models.Word2Vec.load() .Then it works

changhyub commented 6 years ago

@hinamu it works, Thanks

gilgtc commented 6 years ago

@anavaldi

I solve this error by executing on my own word embeddings with the .sh file.

What do you mean?

caitaozhan commented 5 years ago

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

I solved this issue by degrading my gensim version from 3.6 to 3.0

kusumlata123 commented 5 years ago

UnpicklingError Traceback (most recent call last)

in () 3 #model=gensim.models.Word2Vec.load_word2vec_format('model_file', binary=True) Word2Vec.load_word2vec_format 4 #model_bin = KeyedVectors.load_word2vec_format(model_file,binary=True) ----> 5 model=gensim.models.Word2Vec.load(model_file) 6 #model=gensim.Word2Vec.load_word2vec_format('model_file',binary=True) word_vectors = KeyedVectors.load(path) why is it giving
Koteswara-ML commented 5 years ago

@kusumlata123 even i am getting that Unpickling Error

bright1993ff66 commented 5 years ago

I am also getting the unpickling error... Any ideas? My code is:

chinese_model = gensim.models.Word2Vec.load(os.path.join(desktop, 'cc.zh.300.bin.gz')) 
bright1993ff66 commented 5 years ago

I also tried to save the text file and load it via the function provided by the fasttext official site. I first change the file extension from gz to txt and use the following functions:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

However, I got the following errors:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-d67f52bde947> in <module>
----> 1 model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

<ipython-input-3-0f69b5ce62b8> in load_vectors(fname)
      1 def load_vectors(fname):
      2     fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
----> 3     n, d = map(int, fin.readline().split())
      4     data = {}
      5     for line in fin:

ValueError: invalid literal for int() with base 10: '\x08\x08p[\x00\x03cc.zh.300.vec\x00\\ͮfMr7?W3ۀ0|Szдl\x14I\x132'
thejastr commented 4 years ago

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

ashutoshsoni891 commented 3 years ago

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

trungluu91 commented 2 years ago

I came across the same error as well. I changed: word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) into word_vectors = KeyedVectors.load(path) It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

For Korean language, i got this error: 'AttributeError: Can't get attribute 'Vocab' on <module 'gensim.models.word2vec' from 'C:\Users\ductr\Python\lib\site-packages\gensim\models\word2vec.py'>' Would you mind letting me know what the error is?

Louislazarus commented 1 year ago

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

I get the same error after using:

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
model = Word2Vec.load(model_path)

What am I doing wrong?