avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
165 stars 13 forks source link

Can't compress fasttext when loaded from facebook format directly #10

Closed robinp closed 2 years ago

robinp commented 2 years ago

Hello again! Not sure if this is compress-fasttext or gensim problem, but here we go:

Getting the following

Traceback (most recent call last):  
  File "/src/makesmall.py", line 11, in <module>
    small_model = compress_fasttext.prune_ft_freq(orig)
  File "/usr/local/lib/python3.9/site-packages/compress_fasttext/compress.py", line 206, in prune_ft_freq
    ngram_norms = np.linalg.norm(ft.vectors_ngrams, axis=-1)
AttributeError: 'FastText' object has no attribute 'vectors_ngrams'

for the code

import sys
from gensim.models import fasttext
from gensim.test.utils import datapath
import compress_fasttext

[inpath, outpath] = sys.argv[1:3]
print("Loading original from", inpath)
orig = fasttext.load_facebook_model(datapath(inpath))

print("Compressing")
small_model = compress_fasttext.prune_ft_freq(orig)

print("Saving compressed to", outpath)
small_model.save(outpath)

but when roundtripping the gensim model by save & load, it works:

import sys
from gensim.models import fasttext
from gensim.test.utils import datapath
import compress_fasttext

[inpath, outpath] = sys.argv[1:3]
print("Loading biginal from", inpath)
big = fasttext.load_facebook_model(datapath(inpath))

# Note: roundtripping gensim to disk, otherwise compress doesn't work.
print("Saving back in gensim format")
big.wv.save(outpath + ".gensim")

print("Loading gensim")
big = fasttext.FastTextKeyedVectors.load(outpath + ".gensim")

print("Compressing")
small_model = compress_fasttext.prune_ft_freq(big)

print("Saving compressed to", outpath + ".compressed")

env:

gensim == 4.1.2
compress-fasttext == 0.1.2

Thank you!

avidale commented 2 years ago

In Gensim, there are two kinds of fastText models: trainable models (FastText) and trained models (FastTextKeyedVectors). A trained model is initially hidden as a wv property of a trainable model, but can be saved and used separately. compress-fasttext works only with trained models.

When converting from Facebook format, by default you get a trainable model, and you should extract a trained model from it. This means that you have to replace in your code

small_model = compress_fasttext.prune_ft_freq(orig)

with

small_model = compress_fasttext.prune_ft_freq(orig.wv)

and it will work.

robinp commented 2 years ago

Oh, what a silly mistake, thank you! I indeed see the use of .wv in the saving code.