avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
165 stars 13 forks source link

Reproduce and Compress ft_cc.en.300_freqprune_50K_5K_pq_100 #22

Closed HossamAmer12 closed 9 months ago

HossamAmer12 commented 9 months ago

I am trying to reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin model from fasttext original model.

This is my code:


org_model_path = 'cc.en.300.bin'
print(fasttext.util.download_model('en', if_exists='ignore'))  # English
ft = load_facebook_model(org_model_path).wv

small_model = compress_fasttext.prune_ft_freq(
        ft,
        new_vocab_size=5_000,
        new_ngrams_size=2000_000,
        fp16=False,
        pq=True,
        qdim=100,
        centroids=255,
        prune_by_norm=True,
        norm_power=1,
)

The generated model (143MB) and the posted model (12MB) sizes are different. Can you please point out what's missing?

HossamAmer12 commented 9 months ago

I set new_ngrams_size to 50K, but I got a smaller model (6.9MB) than the one already posted(12MB).

The model I got is still a good model. However, It'd be great if anybody could let me know how to reduce the components to 100 before prune_ft_freq.

I tried svd_ft(ft, n_components=100) and then prune_ft_freq, but gives an error in the line below: sorted_vocab = sorted(ft.key_to_index.items(), key=lambda x: ft.get_vecattr(x[0], 'count'), reverse=True)

Says that it cannot get the count attribute.

avidale commented 9 months ago

The generated model (143MB) and the posted model (12MB) sizes are different. Can you please point out what's missing?

Here is the code that I used to create the published compressed English model (you can run it in this Colab notebook):

from gensim.models.fasttext import load_facebook_model
import compress_fasttext
big_model = load_facebook_model('cc.en.300.bin').wv
big_model.adjust_vectors()
new_ngrams_size, new_vocab_size, qdim = 50_000, 5_000, 100
small_model = compress_fasttext.prune_ft_freq(
    big_model, pq=True, new_vocab_size=new_vocab_size, new_ngrams_size=new_ngrams_size, qdim=qdim, centroids=255
)
mn = 'ft_en_freqprune_{}K_{}K_pq_{}.bin'.format(int(new_ngrams_size/1000), int(new_vocab_size/1000), qdim)
print(mn)
small_model.save(mn)

When I ran this code today, it also produced a compressed model different from the posted one. My guess is that the incompatibility is caused by a drift in gensim versions; I compressed the posted model with gensim==4.0.0, but now I can no longer install it.

Nevertheless, I consider the reproduction more-or-less successful, because the newly compressed model behaves similarly (although not identically) to the posted one:

small_model2 = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
    'https://github.com/avidale/compress-fasttext/releases/download/gensim-4-draft/ft_cc.en.300_freqprune_50K_5K_pq_100.bin'
)
print(small_model.most_similar('cat'))
# [('cats', 0.8045504962895952), ('dog', 0.6709342746267867), ('pet', 0.6361820379926278), ...
print(small_model2.most_similar('cat'))
# [('cats', 0.8047200524924056), ('dog', 0.6737426243477002), ('pet', 0.6418062262383539), ...
avidale commented 9 months ago

The model I got is still a good model. However, It'd be great if anybody could let me know how to reduce the components to 100 before prune_ft_freq.

compress-fasttext doesn't work this way. It allows you to apply either product quantization or SVD to a model, but not both at the same time.

HossamAmer12 commented 9 months ago

Thanks @avidale for your responses - That helps :))