facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.85k stars 4.71k forks source link

.bin file and .vec file return different vectors. #660

Open enzolupia opened 5 years ago

enzolupia commented 5 years ago

Hi, everyone, I'm currently using FastText to get words embeddings for given text data, but now I noticed out that the .vec and the .bin file outputs different vectors for the same words. Is there a particular reason why this happens? I checked both the english and italian file and the problem is the same. If I use the .vec vectors the results are good, so the vectors catch words similarity as requested by the task that I'm implementing, but using the .bin file the results are really bad. I've also noticed that the .bin vectors values are really small. i.e. 3.75941932e-01 2.66238451e-01 are two of the returned values.

PS: I'm using the python wrapper

Celebio commented 5 years ago

Hi @enzolupia. Thank you for reporting this issue.

I downloaded the following model: https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip and tried to reproduce with the python wrapper with the following code:

import fastText 
m = fastText.load_model("wiki.en/wiki.en.bin") 
v = m.get_word_vector('hello')

The vector I get from get_word_vector is the same as in the .vec file.

Could you please try the script bin_to_vec.py under python/doc/examples/ and check if it matches? Also, it would be great if you could provide more information to help us reproduce on our end.

wxjia commented 5 years ago

@Celebio The bin file can transfer to vec file, so how can I transfer vec file to bin file? I got a vec file from other tool, however, I want to get similar words by fasttext due to that tool costs too much memory.

arendu-zz commented 5 years ago

I seem to see the same issue as well.

I downloaded this file:https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M-subword.zip

This is my test result:

$>echo "hello" | ./fastText/fasttext print-word-vectors crawl-300d-2M-subword.bin
hello 0.01287 -0.022696 0.018979 -0.069096 -0.044552 -0.001429 0.041804 (truncated)
$> grep '^hello ' crawl-300d-2M-subword.vec
hello 0.0214 -0.0378 0.0316 -0.1152 -0.0743 -0.0024 (truncated)
arendu-zz commented 5 years ago

@Celebio bin_to_vec.py throws this error for me:

$>python fastText/python/doc/examples/bin_to_vec.py crawl-300d-2M-subword.bin
File "fastText/python/doc/examples/bin_to_vec.py", line 30, in <module>
    words = f.get_words()
  File "/home/arenduc1/anaconda3/envs/pytorch041env/lib/python3.6/site-packages/fastText/FastText.py", line 170, in get_words
    pair = self.f.getVocab()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 57: unexpected end of data

I am downloading the file you suggested and will try the same.

ghazeefa commented 5 years ago

I am using fasttext pre-train model for urdu language https://fasttext.cc/docs/en/pretrained-vectors.html Why i am getting different vectors from .bin and .vec file ? Which one should use to evaluate the model?

import gensim.models.keyedvectors as word2vec1 from scipy import spatial from gensim.models import FastText

pathToBinVectors = 'C:/Users/admin/fasttextwiki/wiki.ur.vec' embed_map = word2vec1.KeyedVectors.load_word2vec_format(pathToBinVectors) gg = embed_map.wv.get_vector('سائیکل') hh = embed_map.wv.get_vector('گاڑی') a=1-spatial.distance.cosine(gg,hh) print(a*4)

i got similarity score 1.8220717906951904 when i load .bin file

model = FastText.load_fasttext_format('C:/Users/admin/fasttextwiki/wiki.ur.bin') gg = model.wv.get_vector('سائیکل') hh = model.wv.get_vector('گاڑی') a=1-spatial.distance.cosine(gg,hh) print(a*4)

i got similarity score 0.376111775636673

tbornt commented 5 years ago

I seem to see the same issue as well.

I downloaded this file:https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M-subword.zip

This is my test result:

$>echo "hello" | ./fastText/fasttext print-word-vectors crawl-300d-2M-subword.bin
hello 0.01287 -0.022696 0.018979 -0.069096 -0.044552 -0.001429 0.041804 (truncated)
$> grep '^hello ' crawl-300d-2M-subword.vec
hello 0.0214 -0.0378 0.0316 -0.1152 -0.0743 -0.0024 (truncated)

Same issue. Tried bin_to_vec.py, output vectors are different with the origin vectors in the .vec file. And the vectors from .bin perform badly in a text classification task.

Samriddhi-dubey10 commented 1 year ago

I am trying to download this file https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip but I getting this error --2022-11-03 18:03:40-- https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.116.192 Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.116.192|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2022-11-03 18:03:41 ERROR 403: Forbidden.