akutuzov / webvectors

Web-ify your word2vec: framework to serve distributional semantic models online
http://vectors.nlpl.eu/explore/embeddings/
GNU General Public License v3.0
196 stars 49 forks source link

Can't download geowac_lemmas_none_fasttextskipgram_300_5_2020 following the tutorial #59

Closed karasevdy closed 5 months ago

karasevdy commented 5 months ago

I unzipped the archive and realized that geowac_lemmas_none_fasttextskipgram_300_5_2020 isn't a binary model, that is only 'model.model' was in it and no 'model.bin' like in a ruscorpora_upos_skipgram_600_10_2017 for instance. So, I tried this:

import zipfile model_url = 'http://vectors.nlpl.eu/repository/20/213.zip' m = wget.download(model_url) model_file = model_url.split('/')[-1] with zipfile.ZipFile(model_file, 'r') as archive: stream = archive.open('model.model') model = FastText.load_fasttext_format(datapath(stream))


TypeError Traceback (most recent call last)

in () 7 with zipfile.ZipFile(model_file, 'r') as archive: 8 stream = archive.open('model.model') ----> 9 model = FastText.load_fasttext_format(datapath(stream)) 2 frames /usr/lib/python3.10/genericpath.py in _check_arg_types(funcname, *args) 150 hasbytes = True 151 else: --> 152 raise TypeError(f'{funcname}() argument must be str, bytes, or ' 153 f'os.PathLike object, not {s.__class__.__name__!r}') from None 154 if hasstr and hasbytes: TypeError: join() argument must be str, bytes, or os.PathLike object, not 'ZipExtFile' Then I unzipped the archive, downloaded 'model.model' and 'model.model.vectors.npy' into a google.colab and tried to open each of them direclty eithe via KeyedVectors.load() or via Fast Text.load_fasttext_format()or via gensim.models.fast text.load_facebook_model() or via gensim.models.fast text.load_facebook_vectors(). model = KeyedVectors.load('model.model') --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) in () ----> 1 model = KeyedVectors.load('model.model') 4 frames/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size) 425 own_fid = False 426 else: --> 427 fid = stack.enter_context(open(os_fspath(file), "rb")) 428 own_fid = True 429 FileNotFoundError: [Errno 2] No such file or directory: 'model.model.vectors_vocab.npy' model = KeyedVectors.load('model.model.vectors.npy') --------------------------------------------------------------------------- UnpicklingError Traceback (most recent call last) in () ----> 1 model = KeyedVectors.load('model.model.vectors.npy') 1 frames/usr/local/lib/python3.10/dist-packages/gensim/utils.py in unpickle(fname) 1459 """ 1460 with open(fname, 'rb') as f: -> 1461 return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline() 1462 1463 UnpicklingError: STACK_GLOBAL requires str model = FastText.load_fasttext_format('model.model') --------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) in () ----> 1 model = FastText.load_fasttext_format('model.model') 5 frames/usr/local/lib/python3.10/dist-packages/gensim/models/_fasttext_bin.py in _load_vocab(fin, new_format, encoding) 196 # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc) 197 if nlabels > 0: --> 198 raise NotImplementedError("Supervised fastText models are not supported") 199 logger.info("loading %s words for fastText model from %s", vocab_size, fin.name) 200 NotImplementedError: Supervised fastText models are not supported model = FastText.load_fasttext_format('model.model.vectors.npy') --------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) in () ----> 1 model = FastText.load_fasttext_format('model.model.vectors.npy') 5 frames/usr/local/lib/python3.10/dist-packages/gensim/models/_fasttext_bin.py in _load_vocab(fin, new_format, encoding) 196 # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc) 197 if nlabels > 0: --> 198 raise NotImplementedError("Supervised fastText models are not supported") 199 logger.info("loading %s words for fastText model from %s", vocab_size, fin.name) 200 NotImplementedError: Supervised fastText models are not supported model = FastText.load_facebook_model('model.model') --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 model = FastText.load_facebook_model('model.model.vectors.npy') AttributeError: type object 'FastText' has no attribute 'load_facebook_model' Should I try datapath from gensim.test.utils or api.load from gensim.downloader or pip install fasttext instead of gensim's FastText? How to download any pretrained fasttext model from rusvectores?
akutuzov commented 5 months ago

Hi,

You should have all the files from the archive in one directory, not only model.model and model.model.vectors.npy.

If this is done, gensim.models.KeyedVectors.load() works just fine with this model:

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import gensim

In [2]: model = gensim.models.KeyedVectors.load("model.model")

In [3]: word = "кракозябра"

In [4]: model.most_similar(word)
Out[4]: 
[('краков', 0.6272931694984436),
 ('припорашивать', 0.5428630709648132),
 ('крак', 0.5345099568367004),
 ('краковский', 0.529658317565918),
 ('распуститься', 0.528093159198761),
 ('припорошать', 0.515566885471344),
 ('вроцлав', 0.5138404965400696),
 ('капустный', 0.5137609839439392),
 ('павлиний', 0.512362539768219),
 ('ягель', 0.5122756958007812)]

In [5]: model.most_similar("волк")
Out[5]: 
[('медведь', 0.7839906215667725),
 ('зверь', 0.7489554286003113),
 ('лисица', 0.7402448654174805),
 ('волчица', 0.7251183390617371),
 ('заяц', 0.7193619012832642),
 ('лис', 0.7154371738433838),
 ('волчонок', 0.7136003971099854),
 ('олень', 0.7099077105522156),
 ('шакал', 0.7061660885810852),
 ('лось', 0.7053733468055725)]