VRAM use grows when using SentenceTransformer.encode + potential fix.

martinritchie commented 4 years ago

Hey,

I have been using this repository to obtain sentence embeddings for a data set I am currently working on. When using SentenceTransformer.encode, I noticed that my VRAM usage grows with time until a CUDA out of memory error is raised. Through my own experiments I have found the following:

detaching the embeddings before they are extended to all_embeddings, using: embeddings = embeddings.to("cpu") greatly reduces this growth.
Even with the above line added the VRAM use grew, albiet slowy, by adding the line torch.cuda.empty_cache() after the above VRAM usage appears to stop growing over time. The first point makes sense as a fix but I am unsure why this line is necessary?

I am using: pytorch 1.6.0, transformers 3.3.1, sentence_transformers 0.3.7.

Have I missed something in the docs or am doing something daft? I am happy to submit a pull request if needs be?

Thanks,

Martin

nreimers commented 4 years ago

How do you call the encode method? I.e., which parameters do you pass? Especially how to you set convert_to_numpy or convert_to_tensor

martinritchie commented 4 years ago

Thank you for a quick response!

Here is an example:

   model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens") 
   y = model.encode( text_list, 
                                     batch_size=256,
                                     show_progress_bar=True,
                                     is_pretokenized=False )

So I see that I am using the default values for: convert_to_tensor=False and convert_to_numpy=True, where text_list contain millions of sentences. Shouldn't the embeddings be detached during the for loop define on line 168?

nreimers commented 4 years ago

A quick solution would be to break down text_list to smaller chunks (e.g. only 100k sentences) and to append the embeddings afterwards instead of passing Millions of sentences at once.

I have to think if you can detach at line 168 and what implications this will have if you e.g. want the tensors for some downstream application (e.g. as input to some other pytorch model).

martinritchie commented 4 years ago

I agree that in cases where the embeddings are being immediately used in downstream and GPU based tasks the current approach makes sense, i.e.,. when convert_to_tensor=False. In my use case where I needed to process a large dataset and store the sentence embeddings it was not immediately clear why I encountered cuda OOMs. So, when convert_to_numpy=True one would be safe to detach the tensors before extending them to all_embeddings?

To give you some more context I was using a different repo that makes use of this one and when the original OOM was encountered I was a little perplexed. I think a warning or changing when tenors are detached, as described above, could be a nice way to prevent other users encountering the same problem?

nreimers commented 4 years ago

Agree, when convert_to_numpy=True, I will change the code so that detach and cpu() happens in the loop, not afterwards.

martinritchie commented 4 years ago

Thank you for such a prompt reply and fix, I really appreciate it.

zeno17 commented 4 years ago

@nreimers Do you have ETA on when this will be applied? I am currently having the issue that EmbeddingSimilarityEvaluator consistently runs into "CUDA out of memory" errors because the memory usage keeps increasing untill it hits the GPU limit (16GB) and then crashes.

The model trains succesfully with some batch size but then fails to evaluate it. When monitoring GPU usage it seems that it succesfully encodes batches, but then does not clear it from memory like normally happens during training.

As in pictures, this memory usage will keep increasing untill it runs out of memory and crashes the main training process. I think this is related to this .encode issue

zeno17 commented 4 years ago

Full traceback that shows that it breaks in the evaluator, specifically on the self.forward(features)

martinritchie commented 4 years ago

If you install this repo in editable mode pip install -e . and make the corrections I added in the OP you can work around this problem until it is patched.

nreimers commented 4 years ago

In the latest release, I added detach() in the main loop of encode.

Does it work now better?

zeno17 commented 4 years ago

Hi @nreimers,

Upon deeper inspection of my code i was selecting a file for the evaluator which was much much bigger than the original one i intended. The resulting problem from this was that because the evaluator (EmbeddingSimilarityEvaluator) stores all the embeddings in memory. This means that while batching limits the amount passed to the GPU (which are cleared properly), if you have a test-file with a large amount of entries youre gonna hit memory limits because it doesnt actually process the embeddings in batches (which i think would save a lot of space) for computing the correlations.

selfcontrol7 commented 4 years ago

A quick solution would be to break down text_list to smaller chunks (e.g. only 100k sentences) and to append the embeddings afterwards instead of passing Millions of sentences at once.

I have to think if you can detach at line 168 and what implications this will have if you e.g. want the tensors for some downstream application (e.g. as input to some other pytorch model).

Hi @nreimers Thank you for this trick.

Please, I was wondering if it is the optimal solution for OOM error when trying to compute the cosine similarities? In my use case, I am trying to compute the cosine similarities of a list containing 50.716 company names and I got an OOM error.
Later, I came across the task Paraphrase mining and ran it without any issue. I might be wrong but it seems that Paraphrase mining works with "text" instead of embedding as input. Please, is there any differences in the output similarities result when using Paraphrase Mining and Semantic Textual Similarity?

Thank you

nreimers commented 4 years ago

Hi @selfcontrol7

Computing pairwise cosine similarity between all pairs has a quadratic memory requirement. If you run out of memory, you can try to perform it on CPU.

If you still don't have enough memory, it helps to chunk it into smaller parts. This is what the paraphrase mining code does: Instead of computing a cosine similarity matrix 50k x 50k, it chunks it down and computes e.g. 5 matrices 50k x 10k 50k x 10k 50k x 10k 50k x 10k 50k x 10k

selfcontrol7 commented 4 years ago

Hi @selfcontrol7

Computing pairwise cosine similarity between all pairs has a quadratic memory requirement. If you run out of memory, you can try to perform it on CPU.

If you still don't have enough memory, it helps to chunk it into smaller parts. This is what the paraphrase mining code does: Instead of computing a cosine similarity matrix 50k x 50k, it chunks it down and computes e.g. 5 matrices 50k x 10k 50k x 10k 50k x 10k 50k x 10k 50k x 10k

Hi,

Thank you for your prompt reply and your explanation I appreciate it.

Please, can you guide me on how I can perform the computation on the CPU instead of the GPU? Also, if I understand well, using paraphrase mining would be the best approach in my use case? Does the accuracy in the cosine similarity computation would be different if I chunk my list into smaller parts myself and use the semantic textual similarity method instead?

Sorry if my questions looked so basic, I am a novice in sentence similarity.

Thank you again.

selfcontrol7 commented 4 years ago

Hello

Hi @selfcontrol7 Computing pairwise cosine similarity between all pairs has a quadratic memory requirement. If you run out of memory, you can try to perform it on CPU. If you still don't have enough memory, it helps to chunk it into smaller parts. This is what the paraphrase mining code does: Instead of computing a cosine similarity matrix 50k x 50k, it chunks it down and computes e.g. 5 matrices 50k x 10k 50k x 10k 50k x 10k 50k x 10k 50k x 10k

Hi,

Thank you for your prompt reply and your explanation I appreciate it.

Please, can you guide me on how I can perform the computation on the CPU instead of the GPU? Also, if I understand well, using paraphrase mining would be the best approach in my use case? Does the accuracy in the cosine similarity computation would be different if I chunk my list into smaller parts myself and use the semantic textual similarity method instead?

Sorry if my questions looked so basic, I am a novice in sentence similarity.

Thank you again.

Hello, I dug a bit more and could make it work on the CPU which took a lot of time to compute the similarity. After a couple of tests with paraphrase and semantic search, I think paraphrase worked better.

However, is there any way to limit the default minimum similarity value when running Semantic Textual Similarity, Paraphrase mining, and Semantic search? For instance, I would like the output pairs to not consider similarity below 40% when running Semantic search.

Thank you again for your time and your work.

nreimers commented 4 years ago

Hi, you can sort the results in decreasing order and stop once the score is below your threshold.

selfcontrol7 commented 4 years ago

Hi, you can sort the results in decreasing order and stop once the score is below your threshold.

Ah, Yes simple way to make it. Thank you.

liaocs2008 commented 3 years ago

I got same observation. Current code would still crash because of OOM. I managed to fix it by adding following lines for each iteration at https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L188: """ del embeddings torch.cuda.empty_cache() """

paolorechia commented 2 years ago

I can confirm this is still an issue.

Upon inspecting the source code I initially thought that maybe this block was the culprit:

            features = batch_to_device(features, device)

As nowhere in the code the features seem to be cleared.

It may indeed be that the PR above may fix it https://github.com/UKPLab/sentence-transformers/pull/1717/files

paolorechia commented 2 years ago

I followed liaocs2008's advice, it seems that adding this block to the code improved the memory footprint

del features
del embeddings
torch.cuda.empty_cache()

I however, still kept getting memory leaks, possibly for another reason.

It seems that after training/evaluation, if I delete the model then no memory is freed, and thus if I load a second model, I get an OOM error. This was a problem to me because I'm executing a grid search over the hyper parameters of the SentenceTransformer.

I eventually solved the problem with a workaround, by launching the training loop inside a separate process, and terminating the process each time I switch the model configuration. This guarantees that I always start a new training session with clear memory. Not exactly clean, but works quite well :)

chschroeder commented 1 year ago

Can also confirm that this is still a problem.

I am both training a model and generating predictions with that model inside a for loop (in an active learning setup) and I encountered the same issue as described above. Forcing garbage collection and calling torch.cuda.empty_cache() after both training and prediction step seems to fix at least part of the problem (not sure if there is more) and my code succeeds again:

import gc
import torch

gc.collect()
torch.cuda.empty_cache()

However, this solution is ugly and a fix in sentence-transformers would be a better option.

The problem described is not fixed by setting convert_to_numpy to True (#525).

Edit: I have drawn conclusion too early and am doubting this fix now.

cosqq commented 1 month ago

Hi all,

Its my first time posting on a thread like this and forgive me for my untidiness.

The fix that i got to work would be the use of python's Pool library. The idea is that as a workers die, the memory accompanied ends too as the value is returned.


@ROUTER.post("/inference"):
def make_inference(sentences: Sentence):
    results = {}
    with Pool(2) as p:
        results = p.map(predict, [sentences])
    return {"message": results}

For each of my task, it takes up 5GB VRAM and that is consistent as we increase the load on the API. However, with this fix, the VRAM drops from 5GB back down to 1GB for each request.

I presume this is a temporary fix and hope that a more sustainable fix would come.

Please give me a reaction if it help!! :D

UKPLab / sentence-transformers

VRAM use grows when using SentenceTransformer.encode + potential fix. #487