Open JoyeBright opened 3 years ago
One of your tensors is on the GPU while the other is on the CPU. Check the devices and ensure both tensors are on the same device
Yeah, that's correct.
Just to remind you, I have two tensors for computing similarity: (query_embedding, corpus_embeddings). According to your description, I reckon that the problem is concerned with corpus_embeddings because when I added device='cpu' for query_embeddings it worked perfectly. In other words, corpus_embeddings is on CPU! IDK why!
Regarding this, because my corpus was very large (around 31M) I have encoded its sentences (corpus_embeddings) on GPU separately and saved and loaded it using Pickle.
I was just wondering if it's possible to put corpus embedding tensors on GPU without re-encoding it?
This is all I'm doing:
from sentence_transformers import SentenceTransformer, util
import torch
import pickle
#Load sentences & embeddings from disc
with open('ID_embeddings_128dim.pkl', "rb") as fIn:
stored_data = pickle.load(fIn)
ID_sentences = stored_data['sentences']
ID_embeddings = stored_data['embeddings']
#Load sentences & embeddings from disc
with open('OOD_NOShuffle_all_128dim.pkl', "rb") as fIn:
stored_data2 = pickle.load(fIn)
OOD_sentences = stored_data2['sentences']
OOD_embeddings = stored_data2['embeddings']
embedder = SentenceTransformer('stsb-xlm-r-multilingual-128dim', device='cuda')
queries = ID_sentences
corpus_embeddings = OOD_embeddings
corpus = OOD_sentences
i = 0
data = [] # Just to save the result
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
# We use cosine-similarity and torch.topk to find the highest 5 scores
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
print("\n\n=====================\n\n")
data.append("\n\n======================\n\n")
print("Query "+ str(i) + ":" + query)
i = i+1
data.append("Query:" + str(query))
print("\nTop 5 most similar sentences in corpus:")
data.append("\nTop 5 most similar sentences in corpus:")
for score, idx in zip(top_results[0], top_results[1]):
print(corpus[idx], "(Score: {:.4f})".format(score))
data.append(str(corpus[idx]) + "(Score: {:.4f})".format(score))
Yes, when you load them from pickle they are on CPU. You need to move them to GPU (if the GPU has enough memory)
Corp_emb = torch.tensor(data_from_pickle, device="cuda")
Thanks for your prompt response.
Is it possible to divide and move them to multiple GPUs? I have three 16 GB GPUs. When I use torch.tensor(data_from_pickle, device="cuda") it only loads data into one of them (first one) which is not enough.
I'm asking this because for encoding I employed encode_multi_process with no problem.
Yes. The most simple solution would be to split your corpus embeddings into two equal large tensors and move them to cuda:0, cuda:1, cuda:2
Your query embedding must be moved to all 3 GPUs and on each GPU you must execute the computation of cosine similarity + topk.
Not sure if this can be parallelized in an easy way (never did this with pytorch).
@nreimers Thanks for your reply.
Did you mean split 31M embedding vectors into two equal vectors and move them to cuda:0 and cuda:1? If yes, what should be moved into cuda:2? i.e. nothing will remain to move to cuda:2 if corpus splits into two equal parts?
Sorry, in 3 equal sets so that 10M vectors are on each GPU
Unfortunately, It does not work because each GPU has 16 GB of RAM! Any other solution?
Meantime, do you confirm that Hnswlib works only on the CPU?
For your information, I got this error: RuntimeError: CUDA out of memory. Tried to allocate 4.97 GiB (GPU 0; 15.90 GiB total capacity; 6.08 GiB already allocated; 4.00 GiB free; 11.07 GiB reserved in total by PyTorch)
10M embeddings with 768 dim and float 32 require 30GB memory. With fp16 it will be 15GB, but then you have no memory left for computation.
You can try to minimize the embedding size (see our docs).
Or using ANN with faiss or hnswlib.
Yeah, gonna minimize the embedding size.
I've already tried hnswlib but it employed CPU. That is I can not benefit from GPU. Am I right?
You can use faiss with GPU.
But when you cannot store your embeddings on the GPU, support is limited. Moving data to GPU is quite slow, so it is not worth it to move fractions of it to the GPU for computations.
I come up with an idea. How do you see this?
32D might be a bit too little, see: https://arxiv.org/abs/2012.14210
Otherwise sounds good.
You could also use GPU0 for your model and GPU1/2 to store the corpus embeddings.
Hi there,
I want to exploit semantic search through cosine similarity and to do so, I have prepared the following datasets:
Although I could run the same code on Google Colab (different embedding size: 768), pytorch_cos_sim stuck and threw the following error on the server:
I was wondering if you could elaborate more on how to debug the error, please?
Let me just add that due to the lack of memory, I employed PCA for dimensionality reduction.
Regards, Javad