epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Using fastText to compute the sentence vectors #47

Closed ninikolov closed 4 years ago

ninikolov commented 5 years ago

Hi,

Am I correct in thinking there is no difference between using sent2vec's python interface, and loading the sent2vec model directly into fastText to access the word vectors? E.g. using your Wikipedia unigram model I get:

>>> sent2vec_model.embed_sentence("hello world")

array([-8.04620683e-01,  8.93464625e-01, -3.09190452e-02, -1.37415811e-01,
       -3.63196850e-01,  2.66210616e-01, -7.63243794e-01,  8.93934608e-01,

and with fastText:

>>> fasttext_model = fastText.load_model("wiki_unigrams.bin")
>>> (fasttext_model.get_word_vector("hello") + fasttext_model.get_word_vector("world"))/2

array([-8.04620683e-01,  8.93464625e-01, -3.09190452e-02, -1.37415811e-01,
       -3.63196850e-01,  2.66210616e-01, -7.63243794e-01,  8.93934608e-01,

Thanks for the info.

martinjaggi commented 5 years ago

this is true if you only care about unigrams (word vectors), if you load our provided models as opposed to pretrained fasttext. the sent2vec models however also add bi-gram and tri-gram vectors, which you would not get by this approach.

welcomebyvenkat commented 5 years ago

Hi,

Why its taking more than 2 hours to get the embed_sentence (700 D vector)? Or do i want to do anything special to make it quick.

import sent2vec model = sent2vec.Sent2vecModel() model.load_model('model.bin') emb = model.embed_sentence("once upon a time .") # This is more than 2 hours to return the sentvec

Thanks for your time & info

mpagli commented 5 years ago

Which OS are you using?

welcomebyvenkat commented 5 years ago

Which OS are you using?

Its RHEL 7 with GPU installed (120+ processor). Just to make use of 20+GB RAM while loading bin (My bin size is 21GB)

mpagli commented 5 years ago

embed_sentence shouldn't take two hours. Maybe try to use embed_sentences instead of embed_sentence. Could it be your gcc version?