epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Can't load .bin file in gensim, is there a way to genrate .vec instead? #20

Closed horiacristescu closed 6 years ago

horiacristescu commented 6 years ago

I am trying to load the ".bin" model file in gensim (v3.3.0) from sent2vec, but I get this error:

/usr/local/lib/python2.7/dist-packages/gensim/models/utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
    167     with utils.smart_open(fname) as fin:
    168         print "TEST=", fin.readline()
--> 169         header = utils.to_unicode(fin.readline(), encoding=encoding)
    170         vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
    171         if limit:

/usr/local/lib/python2.7/dist-packages/gensim/utils.pyc in any2unicode(text, encoding, errors)
    327     if isinstance(text, unicode):
    328         return text
--> 329     return unicode(text, encoding, errors=errors)
    330 
    331 

/usr/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 31: invalid continuation byte

I looked for the plain text format (.vec) of the model but I can't find it, I presume sent2vec doesn't generate it.

I also tried "./fasttext print-word-vectors model.bin" but it just hangs.

How can I use the vectors in gensim?

mpagli commented 6 years ago

There is no compatibility between .bin models and gensim. The functionalities to generate the .vec are not hard to add but you would get only unigrams. Without modifying the source code, you can generate a .vec file by first using a command such as:

./fasttext print-sentence-vectors model.bin < vocabulary.txt 

And then merging the output with the tokens in vocabulary.

Here again you cannot get bigram embeddings.

espoirMur commented 3 years ago

There is no compatibility between .bin models and gensim. The functionalities to generate the .vec are not hard to add but you would get only unigrams. Without modifying the source code, you can generate a .vec file by first using a command such as:

./fasttext print-sentence-vectors model.bin < vocabulary.txt 

And then merging the output with the tokens in vocabulary.

Here again you cannot get bigram embeddings.

Thanks for this , how to download both biprams and unigrams ?