avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
165 stars 13 forks source link

Revert the compressed vectors to gensim format #21

Closed HossamAmer12 closed 3 months ago

HossamAmer12 commented 9 months ago

I am using this pre-trained model: ft_cc.en.300_freqprune_50K_5K_pq_100.bin

That's my code:

ft_gensim = compress_fasttext.models.CompressedFastTextKeyedVectors.load(org_model_path)
new_vocab = ft_gensim.key_to_index
new_vectors = ft_gensim.vectors
new_ngrams = ft_gensim.vectors_ngrams

print(type(new_vectors)) # <class 'compress_fasttext.navec_like.PQ'>
print(type(new_ngrams)) # <class 'compress_fasttext.prune.RowSparseMatrix'>
new_vectors = DecomposedMatrix.compress(new_vectors, n_components=100, fp16=True)
new_ngrams = DecomposedMatrix.compress(new_ngrams, n_components=100, fp16=True)

I get this error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part. Is there a way to convert the vectors and ngrams back to gensim format to do this compress operation?

HossamAmer12 commented 9 months ago

I believe that should be the solution:

new_vocab = ft_gensim.key_to_index
new_vectors = ft_gensim.vectors.unpack()
new_ngrams = ft_gensim.vectors_ngrams.unpack()

That being said, this code increases the size of the original model because the final model after SVD will be unpacked. :(

avidale commented 9 months ago

Can you please explain once more, what final goal you want to achieve, and how would you want the solution to look like?

HossamAmer12 commented 9 months ago

My original goal is: (1) Take any language model from here and compress this model down to 2-3 MBs using the prune_ft_freq (2) Use this model and implement the word/sentence look-up without external dependencies.

Since compress-fast text is not building for me [PQ dependency], I am trying to use the posted ft_cc.en.300_freqprune_50K_5K_pq_100.bin and decrease the dimensions to 100 (or 150) so that I have a 2-3 MB fasttext model.

Then I can worry about #2 above later. Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.

avidale commented 9 months ago

implement the word/sentence look-up without external dependencies

What do you mean by "without external dependencies"? You want to do the lookup in pure numpy, without gensim and compress_fasttext packages?

avidale commented 9 months ago

Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.

What kind of pointers do you need? And why do you want to mess with the hashing function?

HossamAmer12 commented 9 months ago

implement the word/sentence look-up without external dependencies

What do you mean by "without external dependencies"? You want to do the lookup in pure numpy, without gensim and compress_fasttext packages?

Yes, that's right in pure numpy. Similar to what's already done here -- If you could point the differences, that'd be appreciated.

HossamAmer12 commented 9 months ago

Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.

What kind of pointers do you need? And why do you want to mess with the hashing function?

Requested Pointers: 1- Can you help me narrow down ft_cc.en.300_freqprune_50K_5K_pq_100.bin model from 300 dim to 100? 2- Can you help me reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin model? What are the steps to compress? compress-fasttext is not working for me.

For Hash function: I do not wish to mess with the function. But I want to know which hash function you are using. Can you provide pointer to its code?

Appreciate your responses :))

avidale commented 9 months ago

1- Can you help me narrow down ft_cc.en.300_freqprune_50K_5K_pq_100.bin model from 300 dim to 100?

This model already has internal product-quantized vectors in 100 dimensions, just as its name tells.

avidale commented 9 months ago

2- Can you help me reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin model? What are the steps to compress? compress-fasttext is not working for me.

I produced it with compress-fasttext. If it is not working for you, please describe to me how exactly to reproduce your problem, and I will fix it.

avidale commented 9 months ago

I do not wish to mess with the function. But I want to know which hash function you are using. Can you provide pointer to its code?

The function is called ft_ngram_hashes, and here I import it from gensim (I try two different paths, because in different versions of gensim, the hash function is located at different places).