akutuzov / webvectors

Web-ify your word2vec: framework to serve distributional semantic models online
http://vectors.nlpl.eu/explore/embeddings/
GNU General Public License v3.0
197 stars 48 forks source link

How to load model from zip file not having .bin file inside? #60

Closed 777umbra closed 4 months ago

777umbra commented 4 months ago

This code works fine nlpl_zip="C:/180.zip" with zipfile.ZipFile(nlpl_zip, "r") as archive: stream = archive.open("model.bin") model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True,unicode_errors='replace')

But when I tried to load model from http://vectors.nlpl.eu/repository/20/212.zip to folder C:/212.zip it doesn't work out, cause there is no model.bin inside. Only these ones изображение But when I try stream = archive.open("model.ckpt.data-00000-of-00001") I've got the following. What am I doing wrong?

UnicodeDecodeError Traceback (most recent call last) Cell In[11], line 9 7 with zipfile.ZipFile(model_file, 'r') as archive: 8 stream = archive.open('model.ckpt.data-00000-of-00001') ----> 9 model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True,unicode_errors='replace')

File C:\ProgramData\anaconda3\lib\site-packages\gensim\models\keyedvectors.py:1719, in KeyedVectors.load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header) 1672 @classmethod 1673 def load_word2vec_format( 1674 cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', 1675 limit=None, datatype=REAL, no_header=False, 1676 ): 1677 """Load KeyedVectors from a file produced by the original C word2vec-tool format. 1678 1679 Warnings (...) 1717 1718 """ -> 1719 return _load_word2vec_format( 1720 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors, 1721 limit=limit, datatype=datatype, no_header=no_header, 1722 )

File C:\ProgramData\anaconda3\lib\site-packages\gensim\models\keyedvectors.py:2058, in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header, binary_chunk_size) 2056 fin = utils.open(fname, 'rb') 2057 else: -> 2058 header = utils.to_unicode(fin.readline(), encoding=encoding) 2059 vocab_size, vector_size = [int(x) for x in header.split()] # throws for invalid file format 2060 if limit:

File C:\ProgramData\anaconda3\lib\site-packages\gensim\utils.py:365, in any2unicode(text, encoding, errors) 363 if isinstance(text, str): 364 return text --> 365 return str(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 1: invalid continuation byte

akutuzov commented 4 months ago

Hi, This is not a word2vec model, this is an ELMo contextualized language model. Gensim cannot handle these models, you should use libraries like https://github.com/ltgoslo/simple_elmo

777umbra commented 4 months ago

Hi, This is not a word2vec model, this is an ELMo contextualized language model. Gensim cannot handle these models, you should use libraries like https://github.com/ltgoslo/simple_elmo

Thanks a lot!

akutuzov commented 4 months ago

Hope this helps!