UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.37k stars 2.49k forks source link

CUDA out of memory #723

Open minyoung90 opened 3 years ago

minyoung90 commented 3 years ago

I found when I interrupted during trainning (e.g. ctrl + z), gpu memory was not cleared. So, whenever I restarted training, raised 'CUDA out of memory'. But, when exception is occurred in a source code, memory was cleared. In this case, should I remove the memory in hand? (I killed the process with python)

+question, I found that if batch size is same (e.g. 64), turning off the mixed precision (use_amp==False) consumes more memory. It is natural thing right?

nreimers commented 3 years ago

Not sure why CUDA does not release the memory. For me, it works: When I kill a process, CUDA is performing garbage collection and frees the memory. Appears to be an issue with your setup / CUDA.

Regarding the question: This is expected. AMP uses float16 instead of float32, so storing weights requires only half the memory.

minyoung90 commented 3 years ago

Thanks for quick answer! More information, I ran my source code in docker container.

tide90 commented 3 years ago

@minyoung90 Do you mean when you suing GPU locally? What GPU do you have. I experience similiar things with rather old GPUs.

minyoung90 commented 3 years ago

@tide90 I used an AWS ec2 (g4dn.xlarge) which has T4.

RGump commented 3 years ago

Relevant for CPU also. After model.encode(query) memory was not cleared.

sonic182 commented 4 weeks ago

In AWS, sagemaker

once it happened to me, my mistake was to install a newer pytorch version inside sagemaker containers that already have one version installed that fully works

Better not to add "torch" as dependency