UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.14k stars 2.46k forks source link

word embeddings #224

Open shainaraza opened 4 years ago

shainaraza commented 4 years ago

Thanks for providing information, for the word embedding, I use following code as given in one of the examples

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.',
             'The quick brown fox jumps over the lazy dog.']

sentence_embeddings = model.encode(sentences, output_value='token_embeddings')
print("Sentence:", sentences[1])
print(sentence_embeddings[1]) 

it gives me an array of array something like this [[-0.7301325 -0.3941787 2.3462217 ... 0.10826997 -0.27091306 -0.08751018] [-0.5559533 -0.5525051 2.3269138 ... -0.50415146 -0.54448223 -0.11626344] [-0.4228519 -0.3109586 2.918267 ... 0.03306676 -0.6508648 0.09592707] [-0. -0. 0. ... -0. -0. -0. ]]

for my understanding, what are the subarrays? Also I have another question, can I reduce the embedding size to 10,20 or 50.

Also, If I have a long sequence of words, should I pass the whole sequence as one sentence (which i did) or tokenize each word and pass as list of elements within encode to get word embeddings.

thanks you

nreimers commented 4 years ago

The first dimension are the different sentences, the second dimension are the embeddings for each token / sub-token.

For dimension reduction, you would need a dimension reduction technique like PCA or LSA.

As BERT uses contextualized word embeddings, i.e., the word embeddings depend on the complete context. If you only pass individual words, you would loose this feature from BERT

shainaraza commented 4 years ago

The first dimension are the different sentences, the second dimension are the embeddings for each token / sub-token.

For dimension reduction, you would need a dimension reduction technique like PCA or LSA.

As BERT uses contextualized word embeddings, i.e., the word embeddings depend on the complete context. If you only pass individual words, you would loose this feature from BERT

superb thanks