avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
169 stars 13 forks source link

Compress FastText to Uploaded Models #18

Closed HossamAmer12 closed 11 months ago

HossamAmer12 commented 1 year ago

`model_path = "./models/en/ft_cc.en.300_freqprune_50K_5K_pq_100.bin"

big_model = gensim.models.fasttext.FastTextKeyedVectors.load(model_path)

small_model = compress_fasttext.prune_ft_freq(big_model, pq=True) `

Why compress fast text to already uploaded models does not work?

Gives this error:

TypeError: loop of ufunc does not support argument 0 of type RowSparseMatrix which has no callable conjugate method

avidale commented 12 months ago

Because gensim.models.fasttext.FastTextKeyedVectors.load can work only with full models, not with compressed ones.

To load a compressed model, please use compress_fasttext.models.CompressedFastTextKeyedVectors.load instead.

HossamAmer12 commented 12 months ago

Thanks, @avidale .. Can you please share the steps of reproducing this model? ft_cc.en.300_freqprune_50K_5K_pq_100.bin

A couple of more questions please: 1- What to do if I want to get the full embedding matrix from the CompressedFTKeys? 2- Any difference between your Compressed implementation and normal fast text implementation (no gensim)? In terms of getting the word vector or anything else? [Refering to this link]

avidale commented 11 months ago

1- What to do if I want to get the full embedding matrix from the CompressedFTKeys?

I am not sure that I understand what is a "full embedding matrix". There is a matrix of embeddings of individual n-grams, but each of them is meaningless on its own, only as a part of a word. There can also be a matrix of word embeddings, but it is incomplete, because the number of possible words is infinite, and for most of them, the embeddings are completed on the fly.

And anyway, this question doesn't seem to be relevant to the issue topic.

avidale commented 11 months ago

2- Any difference between your Compressed implementation and normal fast text implementation (no gensim)? In terms of getting the word vector or anything else?

I have no idea how the Facebook fasttext implementation (i.e. the one with "no gensim") works, and I don't guarantee any compatibility with it.