avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
165 stars 13 forks source link

After unpacking vectors get_vector returns zeroes for OOV words.. #15

Closed beviah closed 2 years ago

beviah commented 2 years ago

to reference https://github.com/avidale/compress-fasttext/issues/8

even with compressed loading OOV vectors are all zeroes, while model which was saved returned actual vectors..

simply test fasttext model on

ft.get_vector('pythom')

before saving and after compressing and saving/loading..

avidale commented 2 years ago

Hi @beviah ! With which model did you encounter this issue, and how was it compressed? Please attach the code that reproduces this behaviour (and indicate the versions of gensim and compress-fasttext), if it it possible.

If you used pruning for n-grams, then indeed, for all n-grams except the k most important ones (k = new n-grams size), the embeddings become zero. So if an OOV word contains no important n-grams, its embedding is indeed going to be zero. It is an expected behaviour. To avoid it, you should use a compression method without n-grams pruning.

However, if such behaviour has occured in a model without n-grams pruning, then it's a bug, so please share the code that reproduces it so that I could fix it.

beviah commented 2 years ago

I will attach the code soon..

but do note that ngram pruning is irrelevant, as the compressed model works properly, then i save it, then load, then it does not work properly after loading!

beviah commented 2 years ago

Name: gensim Version: 4.0.1 Name: compress-fasttext Version: 0.1.3

@avidale Update.. so issue is here actually.. i was wrong in initial description. model is saved/loaded successfully, but upon vector unpacking vectors are not good..

ft = compress_fasttext.CompressedFastTextKeyedVectors.load(path_small)
ft.get_vector('pythom')#correction.. it still works well here.. so loading works! issue was actually in below lines

ft.vectors = ft.vectors.unpack()
ft.vectors = np.float32(ft.vectors)
ft.vectors_ngrams = ft.vectors_ngrams.unpack()
ft.vectors_ngrams = np.float32(ft.vectors_ngrams)

ft.get_vector('pythom')#does not work well.. should give the same result as above

it does not work well after unpacking vectors! before doing it it does work well.. should not this be irrelevant for vector retrieval? should not vectors be the same before and after unpacking? i need uncompressed vectors form for some further processing.. so not a RowSparseMatrix.

beviah commented 2 years ago

never mind.. i figured it out.. i should not do pq at this stage, but later, once i have final vectors..