avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
165 stars 13 forks source link

Problem loading back the saved FastTextKeyedVectors #8

Closed robinp closed 2 years ago

robinp commented 2 years ago

Hello! I tried to compress a fasttext model, and then load back the saved gensim model. On trying to load, got this exception:

Python 3.9.7 (default, Sep 10 2021, 14:59:43) 
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim
>>> sm = gensim.models.fasttext.FastTextKeyedVectors.load('/root/py/train/eng-small')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 995, in load
    return super(FastTextKeyedVectors, cls).load(fname_or_handle, **kwargs)
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/utils.py", line 487, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1019, in _load_specials
    self.adjust_vectors()  # recompose full-word vectors
  File "/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1177, in adjust_vectors
    self.vectors = self.vectors_vocab[:].copy()
TypeError: 'NoneType' object is not subscriptable

Note: saw this warning while compressing:

/root/.local/share/virtualenvs/py-np11W-p9/lib/python3.9/site-packages/scipy/cluster/vq.py:607: UserWarning: One of the
 clusters is empty. Re-run kmeans with a different initialization.                                                     
  warnings.warn("One of the clusters is empty. "     

but then rerunning and checking a case where the warning is not printed, the issue still stands.

Pipfile:

...
[packages]
gensim = "==4.1.2"
compress-fasttext = "==0.1.1"
pqkmeans = "*"
python-Levenshtein = "*"
...

But also with gensim==4.0.0

Thank you!

robinp commented 2 years ago

It seems the unpickled object doesn't have the vectors field, which is why the adjust_vectors is called, which then tries to touch the obviously missing vectors_vocab (the code at https://github.com/avidale/compress-fasttext/blob/master/compress_fasttext/compress.py#L27 didn't set it).

Why could that field be missing when unpickling? It is there on the model before it is saved.

avidale commented 2 years ago

Hello! Could you please provide a complete code snippet with loading the full model, compressing it, saving the small model and loading it? If I could reproduce the problem, it would be much easier to solve it.

robinp commented 2 years ago

Hm, https://github.com/RaRe-Technologies/gensim/blob/4.0.0/gensim/models/fasttext.py#L1072 seems to ignore "vectors" on saving. But then how could this work? Or maybe noone tried to load it back yet.

Re example, yeah, missed it, sorry:

from gensim.models import fasttext
from gensim.test.utils import datapath
import compress_fasttext

""" original to gensim - can skip
print("Loading")
big_model = fasttext.load_facebook_model(datapath("/root/py/train/eng.bin"))

print("Saving back")
big_model.wv.save("/root/py/train/orig.gensim")
"""
print("Load gensim vecs")
loaded = fasttext.FastTextKeyedVectors.load("/root/py/train/orig.gensim")

print("Compressing")
small_model = compress_fasttext.prune_ft_freq(loaded)

print("Saving")
small_model.save('/root/py/train/eng-small2')

print("Load back saved")
sm = fasttext.FastTextKeyedVectors.load('/root/py/train/eng-small2')
avidale commented 2 years ago

Thanks, I think I got it! The old Gensim models had two equivalent attributes, vectors and vectors_vocab (vectors are calculated from vectors_vocab and vectors_ngrams). This is obviously redundant, so I kept only vectors in the model. In the update of Gensim, its developers resolved the redundancy in an alternative way: they decided to save only vectors_vocab, and recompute vectors each time the model is loaded.

I don't want to store both vectors and vectors_vocab, as in the old Gensim (because it takes disk space). But I also don't want to recompute vectors each time the model loads (because it takes CPU and makes the model load slower).

I will think how to resolve this carefully. Maybe, just will override _save_specials. Suggestions are welcome.

avidale commented 2 years ago

@robinp, I have updated the package so that the models are saved and loaded correctly.

Please update it to compress-fasttext>=0.1.2 and check that the problem is gone. You need to replace the line

sm = fasttext.FastTextKeyedVectors.load('/root/py/train/eng-small2')

with

sm = compress_fasttext.CompressedFastTextKeyedVectors.load('/root/py/train/eng-small2')

because compressed models use the optimizations that are not present in FastTextKeyedVectors (and in gensim in general).

robinp commented 2 years ago

Works like a charm, thank you!