similar_by_word/similar_by_key broken after compression

SKRYMr commented 2 years ago

Hello, I am exploring the use of your library to compress some very big unsupervised models however while testing full functionality I encountered this error:

>>> small_model.similar_by_key("ciao")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/skrymr/anaconda3/envs/semantic39/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 889, in similar_by_key
    return self.most_similar(positive=[key], topn=topn, restrict_vocab=restrict_vocab)
  File "/home/skrymr/anaconda3/envs/semantic39/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 842, in most_similar
    mean = self.get_mean_vector(keys, weight, pre_normalize=True, post_normalize=True, ignore_missing=False)
  File "/home/skrymr/anaconda3/envs/semantic39/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 507, in get_mean_vector
    mean = np.zeros(self.vector_size, self.vectors.dtype)
AttributeError: 'PQ' object has no attribute 'dtype'

I compressed the original FastText model with prune_ft_freq() and default parameters except for new_vocab_length and new_ngrams_length both of which I kept at their original values to avoid losing information.

avidale commented 2 years ago

Hi @SKRYMr, thank you for the alert! This method has worked with gensim==4.1.2, but with gensim==4.2.0 it broke for some reason.

While I am investigating the issue, please downgrade gensim to version 4.1.2, and similarity search will work.

avidale commented 2 years ago

@SKRYMr I have added the missing properties and released the version compress-fasttext==0.1.3. So now you can either downgrade to gensim==4.1.2 or upgrade to compress-fasttext==0.1.3, and the both version should work OK.

SKRYMr commented 2 years ago

Wow that was quick! Thanks a lot, great work.

SKRYMr commented 2 years ago

Hi @avidale, I'll take advantage of this open issue to ask you a couple questions instead of opening a new one. I was experimenting a bit with the compressed model and comparing performance to the original. As I said in my first post, these are the parameters that I used to compress the model:

new_vocab_length = len(model.wv.key_to_index), so as to keep all original words new_ngrams_length = 2000000, so as to keep all ngrams buckets in the original model The rest are kept to their default values.

The original model contains 1.4M words and 2M ngrams buckets, weighs 6.1GB on disk and about the same in memory. The compressed model on the other hand weighs 283MB on disk and around 600MB in memory the first time it's loaded.

The experiment goes as follows for both models:

Load model
Perform a similar_by_word() query and monitor memory usage and time elapsed

While the compressed model loads sensibly faster than the original model, it incurs a heavy cost when performing the first query, which can take up to 50-60 seconds as opposed to approximately 0.8s for the original model. The following queries go much better, averaging around 2 seconds, but still way slower than the original model. Furthermore I see a spike in memory usage of around 3GB every time the compressed model is performing the query, lasting only a split second but enough to cause the worker to die in my testing environment.

Of course I expect the difference in execution time and memory usage on the first call due to the compressed model having to expand and recalculate the matrix from the quantized products. What I don't get is why the following calls are still slower than the original model and if there is a way to improve the performance times. I was looking through the code to see if I could spot the exact moment when the decompression happens but I couldn't find it so I would love some pointers to better understand the compression and decompression algorithms if you can spare the time!

We value execution time more than memory usage at the moment but it would still be great if the model could be stored in its compressed version on disk to improve transfer times and restored close to full when loaded to achieve the same speed as the original model.

Thank you for your time and effort!

avidale commented 2 years ago

Hi @SKRYMr,

If you shared more detailed information about your environment (e.g. package versions of gensim and compress-fasttext, and links to the original and the compressed models), you would help me a lot with the debugging. Please consider doing this when you raise the next issue: the easier it is to reproduce the problem, the better help you are going to get.

Still, one can sort out the roots of your problem by (1) reading the code of the method similar_by_word that causes the problem, and (2) thinking about the nature of FastText models and their compression.

To inspect the code, I just click on the similar_by_word method in my IDE (it is PyCharm) and read the definition:

The method turns out to be defined in gensim.models.keyedvectors.KeyedVectors, and its implementation refers to the similar_by_key method of the same class, which, in turn, refers to the most_similar method. Here is its full code:

The code has two suspicious lines: self.fill_norms() (this method computes and stores norms of each vector in the model's vocabulary), and something with dot(self.vectors[clip_start:clip_end], mean). Both these lines actually compute something with the full self.vectors matrix, which is the matrix of each vector in the model's vocabulary!

Why is it bad for performance? Because it is not how compress-FastText was intended to be used!

FastText models (without compression) basically consist of two large matrices: vectors and vectors_ngrams. For each word within the model's vocabulary, embedding is just taken out of vectors, which is SUPER fast. For each out-of-vocabulary word, it is computed using vectors_ngrams, which is also very very fast, because it involves just averaging a few rows selected from this matrix. This is why FastText is called FastText: its embedding are fast to compute. The only problem is that the matrices vectors and vectors_ngrams are too large.

My package compress-FastText overcomes the memory issue by trading some memory for computation. Now, instead of full vectors and vectors_ngrams matrices, we store their minified versions: some rows of these matrices are dropped (pruning), and the rest of rows are packed into a smaller format using some math formula (product quantization). This leads to the compressed model using much less memory than the original one. And now to get an embedding of each word, we retrieve its corresponding compressed row from the compressed matrix, and decompress it into the full embedding, which requires some computation and thus time. But because we do it only for one word, this time is small.

With this in mind, let's think what the most_similar actually does. With uncompressed models, it:

Takes the model.vectors matrix (which is just an ordinary numpy matrix kept in memory);
multiplies it with a vector and computes top n values of the product.

But when you call the most_similar method with a compressed model, it:

Takes the small packed matrix of the model's vocabulary (which is an instance of PQ class used for compression);
Unpacks this matrix into the full vectors matrix!!! This occupies a lot of memory (because the vectors matrix is large; this is why we wanted to compress the model in the first place), and consumes a lot of computation time (because we need to perform computations to unpack a vector for each word in the vocabulary).
multiplies the unpacked matrix with a vector and computes top n values of the product.

To sum up, when you use the most_similar method with a compressed model, you have to decompress it each time almost into a full model (but without the full vectors_ngrams matrix), which consumes your memory and time, and makes you worse off than with just an uncompressed model. And this is inevitable because of how the compression works.

What can I suggest you to do if you want a small and fast model that computes most similar words? There are several options:

Use an uncompressed FastText model. It will consume a lot of memory, but will be fast and will be able to work with OOV words.
Use an uncompressed FastText model with dropped vectors_ngrams. It will use much less memory than a full model (because the ngrams matrix is typically larger than model vocabulary) and will be still very fast, but it will be unable to process OOV words.
Use a compressed FastText model with a very small vocabulary. It will be relatively fast and memory-efficient (because it will have to decompress a smaller number of embeddings), but the number of words that it is able to retrieve will be limited.

Which of these three options you choose depends on what you value more: low memory, fast speed, ability to find neighbours for OOV words, or a large vocabulary from which these neighbours are chosen. Sorry, but you cannot have all four at the same time. This is kind of impossible.

avidale / compress-fasttext

similar_by_word/similar_by_key broken after compression #13