Closed HossamAmer12 closed 3 months ago
I believe that should be the solution:
new_vocab = ft_gensim.key_to_index
new_vectors = ft_gensim.vectors.unpack()
new_ngrams = ft_gensim.vectors_ngrams.unpack()
That being said, this code increases the size of the original model because the final model after SVD will be unpacked. :(
Can you please explain once more, what final goal you want to achieve, and how would you want the solution to look like?
My original goal is:
(1) Take any language model from here and compress this model down to 2-3 MBs using the prune_ft_freq
(2) Use this model and implement the word/sentence look-up without external dependencies.
Since compress-fast text is not building for me [PQ dependency], I am trying to use the posted ft_cc.en.300_freqprune_50K_5K_pq_100.bin
and decrease the dimensions to 100 (or 150) so that I have a 2-3 MB fasttext model.
Then I can worry about #2 above later. Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.
implement the word/sentence look-up without external dependencies
What do you mean by "without external dependencies"? You want to do the lookup in pure numpy
, without gensim
and compress_fasttext
packages?
Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.
What kind of pointers do you need? And why do you want to mess with the hashing function?
implement the word/sentence look-up without external dependencies
What do you mean by "without external dependencies"? You want to do the lookup in pure
numpy
, withoutgensim
andcompress_fasttext
packages?
Yes, that's right in pure numpy. Similar to what's already done here -- If you could point the differences, that'd be appreciated.
Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.
What kind of pointers do you need? And why do you want to mess with the hashing function?
Requested Pointers:
1- Can you help me narrow down ft_cc.en.300_freqprune_50K_5K_pq_100.bin
model from 300 dim to 100?
2- Can you help me reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin
model? What are the steps to compress? compress-fasttext is not working for me.
For Hash function: I do not wish to mess with the function. But I want to know which hash function you are using. Can you provide pointer to its code?
Appreciate your responses :))
1- Can you help me narrow down ft_cc.en.300_freqprune_50K_5K_pq_100.bin model from 300 dim to 100?
This model already has internal product-quantized vectors in 100 dimensions, just as its name tells.
2- Can you help me reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin model? What are the steps to compress? compress-fasttext is not working for me.
I produced it with compress-fasttext
. If it is not working for you, please describe to me how exactly to reproduce your problem, and I will fix it.
I do not wish to mess with the function. But I want to know which hash function you are using. Can you provide pointer to its code?
The function is called ft_ngram_hashes
, and here I import it from gensim (I try two different paths, because in different versions of gensim, the hash function is located at different places).
I am using this pre-trained model:
ft_cc.en.300_freqprune_50K_5K_pq_100.bin
That's my code:
I get this error:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part.
Is there a way to convert the vectors and ngrams back to gensim format to do this compress operation?