Closed beviah closed 2 years ago
Hi @beviah ! With which model did you encounter this issue, and how was it compressed? Please attach the code that reproduces this behaviour (and indicate the versions of gensim and compress-fasttext), if it it possible.
If you used pruning for n-grams, then indeed, for all n-grams except the k most important ones (k = new n-grams size), the embeddings become zero. So if an OOV word contains no important n-grams, its embedding is indeed going to be zero. It is an expected behaviour. To avoid it, you should use a compression method without n-grams pruning.
However, if such behaviour has occured in a model without n-grams pruning, then it's a bug, so please share the code that reproduces it so that I could fix it.
I will attach the code soon..
but do note that ngram pruning is irrelevant, as the compressed model works properly, then i save it, then load, then it does not work properly after loading!
Name: gensim Version: 4.0.1 Name: compress-fasttext Version: 0.1.3
@avidale Update.. so issue is here actually.. i was wrong in initial description. model is saved/loaded successfully, but upon vector unpacking vectors are not good..
ft = compress_fasttext.CompressedFastTextKeyedVectors.load(path_small)
ft.get_vector('pythom')#correction.. it still works well here.. so loading works! issue was actually in below lines
ft.vectors = ft.vectors.unpack()
ft.vectors = np.float32(ft.vectors)
ft.vectors_ngrams = ft.vectors_ngrams.unpack()
ft.vectors_ngrams = np.float32(ft.vectors_ngrams)
ft.get_vector('pythom')#does not work well.. should give the same result as above
it does not work well after unpacking vectors! before doing it it does work well.. should not this be irrelevant for vector retrieval? should not vectors be the same before and after unpacking? i need uncompressed vectors form for some further processing.. so not a RowSparseMatrix.
never mind.. i figured it out.. i should not do pq at this stage, but later, once i have final vectors..
to reference https://github.com/avidale/compress-fasttext/issues/8
even with compressed loading OOV vectors are all zeroes, while model which was saved returned actual vectors..
simply test fasttext model on
ft.get_vector('pythom')
before saving and after compressing and saving/loading..