Tips to do batch scoring for semantic text similarity

user8114 commented 3 years ago

Hello,

This snippet shows comparing sentences1 vs sentences2 through util.pytorch_cos_sim().

Lets say I have two lists with 100 books each.

I could do this --

# books1 is 100 books
# books2 is 100 books

for i in range(len(books1)):
   sentences1 = books1[i]
   sentences2 = books2[i]
   embeddings1 = model.encode(sentences1, convert_to_tensor=True)
   embeddings2 = model.encode(sentences2, convert_to_tensor=True)
   #Compute cosine-similarits
   cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

   #Output the pairs with their score
   for i in range(len(sentences1)):
        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

Any advice on a faster approach that avoids the first for-loop?

nreimers commented 3 years ago

How many sentences are there per book?

user8114 commented 3 years ago

100-400 sentences per book. Average around 180.

util.pytorch_cos_sim(embeddings1, embeddings2) has always worked (no memory limits have ever come up). Just trying to think of ways to speed this up.

nreimers commented 3 years ago

The first for-loop does not matter. I would keep it like this.

Otherwise:

Create one long list of all your sentences
Encode them
Split the embedding according to books.

Note, here is a method for pairwise cosine similarity score computation (not yet released with pip): https://github.com/UKPLab/sentence-transformers/blob/b5a85a826faeab9ad781eeba700308e7913c9700/sentence_transformers/util.py#L79

pytorch_cos_sim computes all combinations, pairwise_cos_sim only cossim(sent1[i], sent2[i])

UKPLab / sentence-transformers

Tips to do batch scoring for semantic text similarity #1138