Closed HossamAmer12 closed 9 months ago
I set new_ngrams_size to 50K, but I got a smaller model (6.9MB) than the one already posted(12MB).
The model I got is still a good model. However, It'd be great if anybody could let me know how to reduce the components to 100 before prune_ft_freq
.
I tried svd_ft(ft, n_components=100)
and then prune_ft_freq
, but gives an error in the line below:
sorted_vocab = sorted(ft.key_to_index.items(), key=lambda x: ft.get_vecattr(x[0], 'count'), reverse=True)
Says that it cannot get the count attribute.
The generated model (143MB) and the posted model (12MB) sizes are different. Can you please point out what's missing?
Here is the code that I used to create the published compressed English model (you can run it in this Colab notebook):
from gensim.models.fasttext import load_facebook_model
import compress_fasttext
big_model = load_facebook_model('cc.en.300.bin').wv
big_model.adjust_vectors()
new_ngrams_size, new_vocab_size, qdim = 50_000, 5_000, 100
small_model = compress_fasttext.prune_ft_freq(
big_model, pq=True, new_vocab_size=new_vocab_size, new_ngrams_size=new_ngrams_size, qdim=qdim, centroids=255
)
mn = 'ft_en_freqprune_{}K_{}K_pq_{}.bin'.format(int(new_ngrams_size/1000), int(new_vocab_size/1000), qdim)
print(mn)
small_model.save(mn)
When I ran this code today, it also produced a compressed model different from the posted one.
My guess is that the incompatibility is caused by a drift in gensim
versions; I compressed the posted model with gensim==4.0.0
, but now I can no longer install it.
Nevertheless, I consider the reproduction more-or-less successful, because the newly compressed model behaves similarly (although not identically) to the posted one:
small_model2 = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
'https://github.com/avidale/compress-fasttext/releases/download/gensim-4-draft/ft_cc.en.300_freqprune_50K_5K_pq_100.bin'
)
print(small_model.most_similar('cat'))
# [('cats', 0.8045504962895952), ('dog', 0.6709342746267867), ('pet', 0.6361820379926278), ...
print(small_model2.most_similar('cat'))
# [('cats', 0.8047200524924056), ('dog', 0.6737426243477002), ('pet', 0.6418062262383539), ...
The model I got is still a good model. However, It'd be great if anybody could let me know how to reduce the components to 100 before prune_ft_freq.
compress-fasttext
doesn't work this way. It allows you to apply either product quantization or SVD to a model, but not both at the same time.
Thanks @avidale for your responses - That helps :))
I am trying to reproduce
ft_cc.en.300_freqprune_50K_5K_pq_100.bin
model from fasttext original model.This is my code:
The generated model (143MB) and the posted model (12MB) sizes are different. Can you please point out what's missing?