Encoding Type for Pretrained Models

Kyubyong / wordvectors

Pre-trained word vectors of 30+ languages

MIT License

2.22k stars 393 forks source link

Encoding Type for Pretrained Models #8

Closed jlmeo closed 7 years ago

jlmeo commented 7 years ago

What is the encoding type for the pre-trained word2vec models? When trying to load a pre-trained model file I get the following error, and I have not been successful in troubleshooting this.

(using Portuguese as an example here)

model = gensim.models.KeyedVectors.load_word2vec_format(
   'pt/pt.bin,
    binary=True,
)

Error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte

jlmeo commented 7 years ago

The correct way to load one of these pretrained models is as follows:

model = gensim.models.KeyedVectors.load( 'pt/pt.bin')

gitlost-murali commented 7 years ago

And how do I get the vector of a particular word from the model. @jlmeo