Open enzolupia opened 5 years ago
Hi @enzolupia. Thank you for reporting this issue.
I downloaded the following model: https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip and tried to reproduce with the python wrapper with the following code:
import fastText
m = fastText.load_model("wiki.en/wiki.en.bin")
v = m.get_word_vector('hello')
The vector I get from get_word_vector is the same as in the .vec file.
Could you please try the script bin_to_vec.py
under python/doc/examples/ and check if it matches?
Also, it would be great if you could provide more information to help us reproduce on our end.
@Celebio The bin file can transfer to vec file, so how can I transfer vec file to bin file? I got a vec file from other tool, however, I want to get similar words by fasttext due to that tool costs too much memory.
I seem to see the same issue as well.
I downloaded this file:https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M-subword.zip
This is my test result:
$>echo "hello" | ./fastText/fasttext print-word-vectors crawl-300d-2M-subword.bin
hello 0.01287 -0.022696 0.018979 -0.069096 -0.044552 -0.001429 0.041804 (truncated)
$> grep '^hello ' crawl-300d-2M-subword.vec
hello 0.0214 -0.0378 0.0316 -0.1152 -0.0743 -0.0024 (truncated)
@Celebio bin_to_vec.py
throws this error for me:
$>python fastText/python/doc/examples/bin_to_vec.py crawl-300d-2M-subword.bin
File "fastText/python/doc/examples/bin_to_vec.py", line 30, in <module>
words = f.get_words()
File "/home/arenduc1/anaconda3/envs/pytorch041env/lib/python3.6/site-packages/fastText/FastText.py", line 170, in get_words
pair = self.f.getVocab()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 57: unexpected end of data
I am downloading the file you suggested and will try the same.
I am using fasttext pre-train model for urdu language https://fasttext.cc/docs/en/pretrained-vectors.html Why i am getting different vectors from .bin and .vec file ? Which one should use to evaluate the model?
import gensim.models.keyedvectors as word2vec1 from scipy import spatial from gensim.models import FastText
pathToBinVectors = 'C:/Users/admin/fasttextwiki/wiki.ur.vec' embed_map = word2vec1.KeyedVectors.load_word2vec_format(pathToBinVectors) gg = embed_map.wv.get_vector('سائیکل') hh = embed_map.wv.get_vector('گاڑی') a=1-spatial.distance.cosine(gg,hh) print(a*4)
i got similarity score 1.8220717906951904 when i load .bin file
model = FastText.load_fasttext_format('C:/Users/admin/fasttextwiki/wiki.ur.bin') gg = model.wv.get_vector('سائیکل') hh = model.wv.get_vector('گاڑی') a=1-spatial.distance.cosine(gg,hh) print(a*4)
i got similarity score 0.376111775636673
I seem to see the same issue as well.
I downloaded this file:https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M-subword.zip
This is my test result:
$>echo "hello" | ./fastText/fasttext print-word-vectors crawl-300d-2M-subword.bin hello 0.01287 -0.022696 0.018979 -0.069096 -0.044552 -0.001429 0.041804 (truncated) $> grep '^hello ' crawl-300d-2M-subword.vec hello 0.0214 -0.0378 0.0316 -0.1152 -0.0743 -0.0024 (truncated)
Same issue. Tried bin_to_vec.py
, output vectors are different with the origin vectors in the .vec file.
And the vectors from .bin perform badly in a text classification task.
I am trying to download this file https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip but I getting this error --2022-11-03 18:03:40-- https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.116.192 Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.116.192|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2022-11-03 18:03:41 ERROR 403: Forbidden.
Hi, everyone, I'm currently using FastText to get words embeddings for given text data, but now I noticed out that the .vec and the .bin file outputs different vectors for the same words. Is there a particular reason why this happens? I checked both the english and italian file and the problem is the same. If I use the .vec vectors the results are good, so the vectors catch words similarity as requested by the task that I'm implementing, but using the .bin file the results are really bad. I've also noticed that the .bin vectors values are really small. i.e. 3.75941932e-01 2.66238451e-01 are two of the returned values.
PS: I'm using the python wrapper