epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Learned Embeddings #30

Closed EliHei closed 6 years ago

EliHei commented 6 years ago

Hi,

How can I extract the embedding vectors after using ./fasttext sent2vec... on a training set of sentences?

mpagli commented 6 years ago

Hi @EliHei,

By embedding vectors you mean n-gram embedding vectors? You can extract the unigram embeddings by asking sentence embeddings for each word of your vocabulary. For bigram embeddings it is a bit more tricky, each bigram is mapped to an embedding through some hashing function and there is no functionality to give a list of ngrams and get the associated embeddings.

I hope this is answering your question :) ?

EliHei commented 6 years ago

Got it! Thanks.

patrickfrank1 commented 5 years ago

Hi,

Does that mean it is impossible to manually compose sentence embeddings by averaging all n-grams it is composed of?

I tried it here, but it does not seem to work?! https://colab.research.google.com/drive/1NAxHZShWSj9X4YYXSfkpCOrE1C6hOIMe

Best, Patrick

mpagli commented 5 years ago

Does that mean it is impossible to manually compose sentence embeddings by averaging all n-grams it is composed of?

You would need to apply the same hashing function to locate the ngram embeddings in the matrix.

Asking for the embedding of "i-like" will give you the unigram embedding of "i-like". If you ask for the embedding of "i like" then you'll get the average between the unigram emb of "i" the unigram emb of "like" and the bigram emb of "i like". I guess you could retrieve the bigram embedding of "i like" by subtracting the unigram embeddings from the sent2vec embedding.

patrickfrank1 commented 5 years ago

@mpagli thanks for the explaination. I tried to reconstruct the bigrams in the way you suggested above (see same colab notebook from above). It works as far as I can tell. Best, Patrick