avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
165 stars 13 forks source link

broken features after compression? #2

Closed eps696 closed 4 years ago

eps696 commented 4 years ago

is it planned behaviour that some model features stop working after such compression? e.g.: model.most_similar issues "IndexError: list index out of range " model.doesnt_match always outputs the first element from the input list

avidale commented 4 years ago

Yes, we didn't test for all the features after compression. For some of them, compression interferes with Gensim code in an unpleasant way.

But if you do use these features on a regular basis, we could create a subclass of FastTextKeyedVectors that does deal with them.

@eps696 If you provide the full code of your errors (with which model files and on what inputs they reproduce), it would be a great help.

eps696 commented 4 years ago

i've tested all 4 mentioned compression methods on all fasttext models from RusVectores (as well as your compressed model), got absolutely identical issues.

example code:

import gensim
gensim.models.fasttext.FastTextKeyedVectors.load('model/ft_freqprune_100K_20K_pq_100.bin')
print('doesnt_match: привет пока зачем прощай ::', 
  model.doesnt_match("привет пока зачем прощай".split()))
print('doesnt_match: left right back orange ::', 
  model.doesnt_match("left right back orange".split()))
print(model.most_similar("спасибо"))

example result:

doesnt_match: привет пока зачем прощай :: привет
doesnt_match: left right back orange :: left

Traceback (most recent call last):
  File "F:\_neuro\_a\sequent\fastext\test-.py", line 21, in <module>
    print(model.most_similar("спасибо"))
  File "C:\Users\eps\AppData\Local\Programs\Python\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 569, in most_similar
    result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
  File "C:\Users\eps\AppData\Local\Programs\Python\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 569, in <listcomp>
    result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
IndexError: list index out of range

correct output of doesnt_match with a full model:

doesnt_match: привет пока зачем прощай :: зачем
doesnt_match: left right back orange :: orange

btw, if such problems only happen with gensim, you can advise some other methods to work with compressed models (i doubt it would fix doesnt_match issue though)

avidale commented 4 years ago

After some investigation:

  1. The problem with model_doesnt_match was due to the incorrect way Gensim handles FasttextKeyedVectors.word_vec with use_norm=True. I already fixed it in a pull-request to Gensim, but it hasn't been released yet.
  2. The problem with most_similar is that the Gensim models have an attribute index2word, used only in this method. This attribute is initialized in a non-obvious way, and I forgot about it.

I fixed these two problems by creating the class compress_fasttext.models.CompressedFastTextKeyedVectors inheriting from gensim.models.keyed_vectors.FastTextKeyedVectors.

@eps696 Please try it and tell whether you find any new issues with it.

eps696 commented 4 years ago

@avidale perfect, thank you! (funny that most_similar shows better results now with compressed model than with original)

avidale commented 4 years ago

(funny that most_similar shows better results now with compressed model than with original)

It is fully expected: currently most_similar in Gensim FastText returns incorrect results because it applies vector normalization in the wrong place.