UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.09k stars 2.46k forks source link

word similarity using SBERT #1173

Open rhoi2021 opened 3 years ago

rhoi2021 commented 3 years ago

Is it possible to use SBERT to calculate similarity between two words? If so, please explain more. I checked the issue https://github.com/UKPLab/sentence-transformers/issues/884 but it is not clear for me how to get a similarity score for two words as input. Thanks in advance,

nreimers commented 3 years ago

just pass the word and get the embedding back

manwithtwohats commented 3 years ago

I like to use something like here:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model_name = 'sentence-transformers/all-MiniLM-L6-v2'

model = SentenceTransformer(model_name)

words = ["food", "pizza", "restaurant"]

embeddings = model.encode(words)

similarity_matrix = cosine_similarity(embeddings, embeddings)

words_sims = list(zip(words, similarity_matrix))

for word, similarities in words_sims:
    print("Word '" + word + "' is similar to... ")
    words_sims_for_this = list(zip(words, similarities))
    for sim_word, sim in words_sims_for_this:
        print("* " + sim_word.rjust(20) + ": " + str(sim))
    print()

This will print something like:

Word 'food' is similar to... 
*                 food: 1.0000001
*                pizza: 0.3648536
*           restaurant: 0.5901749
rhoi2021 commented 3 years ago

Could u please explain more how S-BERT can be used to find similarity between words as well? I mean BERT generates contextualized word embeddings, which means that BERT provides the most accurate embeddings when a word is in a sentence(context). So when there is no context, how does it do that? Here are some misunderstandings for me. Thanks in advance,

nreimers commented 3 years ago

If you have longer text and you are interested in the word in this context, then SBERT can return the token_embeddings (set output_value on the encode method to token_embeddings). You then need to find the right token embedding for your desired word.

manwithtwohats commented 3 years ago

Regarding token_embeddings - how to find the right token embedding? For example:

stuff = "these are words."
token_embeddings = model.encode(stuff, output_value="token_embeddings")
tokens = model.tokenizer(stuff)

outputting token_embeddings and tokens printed looks something like:

{'input_ids': [0, 6097, 621, 34153, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}
token embeddings:
tensor([[ 0.1160,  0.4853,  2.1967,  ..., -0.2413, -0.1317,  0.2210],
        [ 0.0860,  0.7697,  1.4933,  ..., -0.3778, -0.2841,  0.3136],
        [ 0.1127,  0.6511,  1.6048,  ..., -0.3782, -0.2001,  0.2537],
        [ 0.4265,  0.5819,  1.1807,  ..., -0.3800, -0.1344,  0.0962],
        [ 0.3192,  0.6737,  1.5648,  ..., -0.5888, -0.1932,  0.1575],
        [ 0.1939,  0.6874,  1.6102,  ..., -0.5273, -0.2334,  0.2177]])

So, the embedding for word "these" in this context would be the first index [ 0.1160, 0.4853, 2.1967, ..., -0.2413, -0.1317, 0.2210] and embedding for next empty space " " would be [ 0.0860, 0.7697, 1.4933, ..., -0.3778, -0.2841, 0.3136] and "words" would be fifth one [ 0.3192, 0.6737, 1.5648, ..., -0.5888, -0.1932, 0.1575] - is this correct?

Is there some way to know easily which word maps to which token_embedding? Can you list token_to_word or similar somehow?

nreimers commented 3 years ago

Sadly it is not that simple. The models apply quite different tokenization strategies. BERT uses word-piece, where we have a vocab of 30k tokens. Longer and less frequent words are broken down to multiple word pieces.

You would need to tokenize your input with the models tokenizer, then count at which position your word starts. Also note, a word can have multiple word pieces leading to multiple output embeddings

manwithtwohats commented 3 years ago

Thanks, I already thought I got it but apparently not... so how can you tokenize and count position? I tried the following:

model_name = 'sentence-transformers/stsb-xlm-r-multilingual'
model = SentenceTransformer(model_name)
sentence = "that was some good coffee"
tokens = model.tokenizer.tokenize(sentence) 
print(tokens)

outputs: ['▁that', '▁was', '▁some', '▁good', '▁coffee']

but token_embeds = model.encode(sentence, output_value="token_embeddings") shows there is tensor with 7 embeddings.

tensor([[-0.6990, -0.1310,  0.9027,  ..., -0.0687, -0.3438, -0.1308],
        [-0.9464,  0.1469, -0.0700,  ..., -0.0828, -0.3840,  0.1284],
        [-0.8786, -0.0124, -0.0140,  ..., -0.0976, -0.3367,  0.0250],
        ...,
        [-0.8716,  0.1066,  0.0599,  ..., -0.1671, -0.3457, -0.0553],
        [-0.3945, -0.1310, -0.0573,  ...,  0.2260, -0.3654,  0.1350],
        [-0.4960, -0.0781, -0.0224,  ...,  0.1932, -0.4126,  0.1121]])

How could I know what is the embedding of word "good" when there's 5 words in my sentence, but 7 token embeddings?

EDIT: Answer is (as per nreimers' answer below) the first and last embeds belong to special tokens so based on that fourth word ("good") in the example sentence would be on 5th index on "token embeds".

nreimers commented 3 years ago

Transformer networks add a special token to the start and to the end of the of the input (for BERT it is [CLS] and [SEP]).

Just skip the first embedding, it belongs to the [CLS] token.