Closed eps696 closed 4 years ago
Yes, we didn't test for all the features after compression. For some of them, compression interferes with Gensim code in an unpleasant way.
But if you do use these features on a regular basis, we could create a subclass of FastTextKeyedVectors that does deal with them.
@eps696 If you provide the full code of your errors (with which model files and on what inputs they reproduce), it would be a great help.
i've tested all 4 mentioned compression methods on all fasttext models from RusVectores (as well as your compressed model), got absolutely identical issues.
example code:
import gensim
gensim.models.fasttext.FastTextKeyedVectors.load('model/ft_freqprune_100K_20K_pq_100.bin')
print('doesnt_match: привет пока зачем прощай ::',
model.doesnt_match("привет пока зачем прощай".split()))
print('doesnt_match: left right back orange ::',
model.doesnt_match("left right back orange".split()))
print(model.most_similar("спасибо"))
example result:
doesnt_match: привет пока зачем прощай :: привет
doesnt_match: left right back orange :: left
Traceback (most recent call last):
File "F:\_neuro\_a\sequent\fastext\test-.py", line 21, in <module>
print(model.most_similar("спасибо"))
File "C:\Users\eps\AppData\Local\Programs\Python\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 569, in most_similar
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
File "C:\Users\eps\AppData\Local\Programs\Python\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 569, in <listcomp>
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
IndexError: list index out of range
correct output of doesnt_match
with a full model:
doesnt_match: привет пока зачем прощай :: зачем
doesnt_match: left right back orange :: orange
btw, if such problems only happen with gensim, you can advise some other methods to work with compressed models (i doubt it would fix doesnt_match
issue though)
After some investigation:
FasttextKeyedVectors.word_vec
with use_norm=True
. I already fixed it in a pull-request to Gensim, but it hasn't been released yet.most_similar
is that the Gensim models have an attribute index2word
, used only in this method. This attribute is initialized in a non-obvious way, and I forgot about it. I fixed these two problems by creating the class compress_fasttext.models.CompressedFastTextKeyedVectors
inheriting from gensim.models.keyed_vectors.FastTextKeyedVectors
.
@eps696 Please try it and tell whether you find any new issues with it.
@avidale perfect, thank you!
(funny that most_similar
shows better results now with compressed model than with original)
(funny that
most_similar
shows better results now with compressed model than with original)
It is fully expected: currently most_similar
in Gensim FastText returns incorrect results because it applies vector normalization in the wrong place.
is it planned behaviour that some model features stop working after such compression? e.g.:
model.most_similar
issues "IndexError: list index out of range "model.doesnt_match
always outputs the first element from the input list