avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
169 stars 13 forks source link

compress_fasttext 0.0.7 doesn't work with gensim 3.7.2 #16

Closed mglowacki100 closed 2 years ago

mglowacki100 commented 2 years ago

I've tried to compress gensim 3.7.2 fasttext model with compress_fasttext 0.0.7:

import gensim
import compress_fasttext

big_model = gensim.models.fasttext.FastTextKeyedVectors.load('path-to-original-model')
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True) #ERROR
small_model.save('path-to-new-model')

I've got errror: `

AttributeError: 'FastText' object has no attribute 'vectors_ngrams' with call of prune_ft_freq Alternatively with prune_ft function: AttributeError: 'FastText' object has no attribute 'vocab'

Is gensim 3.7.2 too old or I miss something; maybe there was a version of compress_fasttext that supported it?

avidale commented 2 years ago

@mglowacki100 could you please share a link to the original model so that I could reproduce this problem?

In general, compress_fasttext 0.0.7 is expected to work with gensim 3.7.2.

mglowacki100 commented 2 years ago

@avidale Thanks for fast reply, Here are the steps to reproduce issue with google colab: https://colab.research.google.com/gist/mglowacki100/1b018bab65199fdd6060204802d60de7/compress_ft_gensim.ipynb Script is based on https://github.com/RaRe-Technologies/gensim/releases/3.6.0 and compress_fasttext.

avidale commented 2 years ago

The error that you got is due to the difference between the FastText and FastTextKeyedVectors classes in Gensim. The former includes the latter along with some additional information used only for training the model. The compress_fasttext package works only with the latter.

After running

ft_model = FastText(corpus_file=corpus_fname, workers=-1)
ft_model.save('ft.model')
big_model = gensim.models.fasttext.FastTextKeyedVectors.load('ft.model')

you create a FastText object instead of a FastTextKeyedVectors object (which I find very confusing). Instead, you should access its .wv property:

print(type(big_model))  # gensim.models.fasttext.FastText
print(type(big_model.wv))  # gensim.models.keyedvectors.FastTextKeyedVectors

Thus, in order to make compress_fasttext work, please just replace

small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)

with

small_model = compress_fasttext.prune_ft_freq(big_model.wv, pq=True)

and it should be OK.

mglowacki100 commented 2 years ago

Thank you @avidale it solved my issue.