UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.83k stars 2.44k forks source link

Cosine similarity between sentences #306

Open sumit11112 opened 4 years ago

sumit11112 commented 4 years ago

Hi,

Below code is trowing error at cdist(a1, a2, 'cosine')[0][0]

How can I measure cosine similarity between sentences. I can not provide all strings in single array and on shot.

from sentence_transformers import SentenceTransformer, LoggingHandler from sentence_transformers import models, losses import numpy as np

model = SentenceTransformer('distiluse-base-multilingual-cased')

sentence_embeddings = model.encode(['This framework generates embeddings for each input sentence']]) a1 = 1 for sentence, embedding in zip(sentences, sentence_embeddings): print("Sentence:", sentence) print("Embedding:", embedding) a1 = embedding print("")

sentence_embeddings = model.encode(['Sentences are passed as a list of string.']]) a2 = 1 for sentence, embedding in zip(sentences, sentence_embeddings): print("Sentence:", sentence) print("Embedding:", embedding) a2 = embedding print("")

import scipy.spatial yy = scipy.spatial.distance.cdist(a1, a2, "cosine")[0] print(yy)

Y = cdist(a1, a2, 'cosine')[0][0] print(Y)

nreimers commented 4 years ago

cdist expect a two dimensional array as input. Changing the code like this:

yy = scipy.spatial.distance.cdist([a1], [a2], "cosine")[0]

works for me.

sumit11112 commented 4 years ago

Hi,

This might look stupid but as an confirmation. Can I persist above embedding array a1 and a2. Later incase of need again create iD array and do yy = scipy.spatial.distance.cdist([a1], [a2], "cosine")[0] it will work right?

As I am using 'distiluse-base-multilingual-cased' embeddings for a pre-trained french language will be close to that of english version?

nreimers commented 4 years ago

Yes, you can persist the embeddings on disc and load them later. You can use pickle or numpy save/load functions.