Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity

JoyeBright commented 3 years ago

Hi there,

I want to exploit semantic search through cosine similarity and to do so, I have prepared the following datasets:

Queries: <class 'list'> 179435 Corpus embeddings: <class 'numpy.ndarray'> (31257735, 128) Corpus: <class 'list'> 31257735

Although I could run the same code on Google Colab (different embedding size: 768), pytorch_cos_sim stuck and threw the following error on the server:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-18-9f8d1c8ab6d4> in <module>
      5 
      6     # We use cosine-similarity and torch.topk to find the highest 5 scores
----> 7     cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
      8     top_results = torch.topk(cos_scores, k=top_k)
      9 

~/anaconda3/envs/method2/lib/python3.8/site-packages/sentence_transformers/util.py in pytorch_cos_sim(a, b)
     19     :return: Matrix with res[i][j]  = cos_sim(a[i], b[j])
     20     """
---> 21     return cos_sim(a, b)
     22 
     23 def cos_sim(a: Tensor, b: Tensor):

~/anaconda3/envs/method2/lib/python3.8/site-packages/sentence_transformers/util.py in cos_sim(a, b)
     40     a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
     41     b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
---> 42     return torch.mm(a_norm, b_norm.transpose(0, 1))
     43 
     44 

RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)

I was wondering if you could elaborate more on how to debug the error, please?

Let me just add that due to the lack of memory, I employed PCA for dimensionality reduction.

Regards, Javad

nreimers commented 3 years ago

One of your tensors is on the GPU while the other is on the CPU. Check the devices and ensure both tensors are on the same device

JoyeBright commented 3 years ago

Yeah, that's correct.

Just to remind you, I have two tensors for computing similarity: (query_embedding, corpus_embeddings). According to your description, I reckon that the problem is concerned with corpus_embeddings because when I added device='cpu' for query_embeddings it worked perfectly. In other words, corpus_embeddings is on CPU! IDK why!

Regarding this, because my corpus was very large (around 31M) I have encoded its sentences (corpus_embeddings) on GPU separately and saved and loaded it using Pickle.

I was just wondering if it's possible to put corpus embedding tensors on GPU without re-encoding it?

This is all I'm doing:


from sentence_transformers import SentenceTransformer, util
import torch
import pickle
#Load sentences & embeddings from disc
with open('ID_embeddings_128dim.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    ID_sentences = stored_data['sentences']
    ID_embeddings = stored_data['embeddings']
#Load sentences & embeddings from disc
with open('OOD_NOShuffle_all_128dim.pkl', "rb") as fIn:
    stored_data2 = pickle.load(fIn)
    OOD_sentences = stored_data2['sentences']
    OOD_embeddings = stored_data2['embeddings']

embedder = SentenceTransformer('stsb-xlm-r-multilingual-128dim', device='cuda')
queries = ID_sentences
corpus_embeddings = OOD_embeddings
corpus = OOD_sentences

i = 0
data = [] # Just to save the result
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n=====================\n\n")
    data.append("\n\n======================\n\n")
    print("Query "+ str(i) + ":" + query)
    i = i+1
    data.append("Query:" + str(query))
    print("\nTop 5 most similar sentences in corpus:")
    data.append("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        data.append(str(corpus[idx]) + "(Score: {:.4f})".format(score))

nreimers commented 3 years ago

Yes, when you load them from pickle they are on CPU. You need to move them to GPU (if the GPU has enough memory)

Corp_emb = torch.tensor(data_from_pickle, device="cuda")

JoyeBright commented 3 years ago

Thanks for your prompt response.

Is it possible to divide and move them to multiple GPUs? I have three 16 GB GPUs. When I use torch.tensor(data_from_pickle, device="cuda") it only loads data into one of them (first one) which is not enough.

I'm asking this because for encoding I employed encode_multi_process with no problem.

nreimers commented 3 years ago

Yes. The most simple solution would be to split your corpus embeddings into two equal large tensors and move them to cuda:0, cuda:1, cuda:2

Your query embedding must be moved to all 3 GPUs and on each GPU you must execute the computation of cosine similarity + topk.

Not sure if this can be parallelized in an easy way (never did this with pytorch).

JoyeBright commented 3 years ago

@nreimers Thanks for your reply.

Did you mean split 31M embedding vectors into two equal vectors and move them to cuda:0 and cuda:1? If yes, what should be moved into cuda:2? i.e. nothing will remain to move to cuda:2 if corpus splits into two equal parts?

nreimers commented 3 years ago

Sorry, in 3 equal sets so that 10M vectors are on each GPU

JoyeBright commented 3 years ago

Unfortunately, It does not work because each GPU has 16 GB of RAM! Any other solution?

Meantime, do you confirm that Hnswlib works only on the CPU?

JoyeBright commented 3 years ago

For your information, I got this error: RuntimeError: CUDA out of memory. Tried to allocate 4.97 GiB (GPU 0; 15.90 GiB total capacity; 6.08 GiB already allocated; 4.00 GiB free; 11.07 GiB reserved in total by PyTorch)

nreimers commented 3 years ago

10M embeddings with 768 dim and float 32 require 30GB memory. With fp16 it will be 15GB, but then you have no memory left for computation.

You can try to minimize the embedding size (see our docs).

Or using ANN with faiss or hnswlib.

JoyeBright commented 3 years ago

Yeah, gonna minimize the embedding size.

I've already tried hnswlib but it employed CPU. That is I can not benefit from GPU. Am I right?

nreimers commented 3 years ago

You can use faiss with GPU.

But when you cannot store your embeddings on the GPU, support is limited. Moving data to GPU is quite slow, so it is not worth it to move fractions of it to the GPU for computations.

JoyeBright commented 3 years ago

I come up with an idea. How do you see this?

Minimizing the corpus embedding size. for example 32D
Splitting 31M * 32D corpus into 3 equal parts and move them into each GPUs.
Moving queries to all three GPUs.
Computing cosine similarity between queries and each GPU separately. (not parallel)
Saving step4 results (top n words obtained from comparison)
Merging step5 results
Do step 4 until n iterations.

nreimers commented 3 years ago

32D might be a bit too little, see: https://arxiv.org/abs/2012.14210

Otherwise sounds good.

You could also use GPU0 for your model and GPU1/2 to store the corpus embeddings.

UKPLab / sentence-transformers

Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity #852