I encountered a CUDA out of memory (OOM) error during the training process after completing 50% of the total epochs. Each experiment had a different epoch of errors. (dino_swin_large_384_4scale_36ep training)
I use nvcr.io/nvidia/pytorch:23.01-py3 container which has under specs.
NVIDIA CUDA : 12.0.1
cuDNN : 8.7.0
pytorch : 1.14.0
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.04 GiB (GPU 0; 79.35 GiB total capacity; 69.64 GiB already allocated; 879.56 MiB free; 77.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I suspect that this error may be caused by a sum of garbage that remains in the process, leading to the OOM error. However, I'm not sure why it occurred after completing half of the training process.(sometimes after 90% of training)
Can anyone suggest a possible cause for this issue and any potential solutions to prevent it from happening in the future?
I encountered a CUDA out of memory (OOM) error during the training process after completing 50% of the total epochs. Each experiment had a different epoch of errors. (dino_swin_large_384_4scale_36ep training) I use nvcr.io/nvidia/pytorch:23.01-py3 container which has under specs. NVIDIA CUDA : 12.0.1 cuDNN : 8.7.0 pytorch : 1.14.0
I suspect that this error may be caused by a sum of garbage that remains in the process, leading to the OOM error. However, I'm not sure why it occurred after completing half of the training process.(sometimes after 90% of training)
Can anyone suggest a possible cause for this issue and any potential solutions to prevent it from happening in the future?
Thank you in advance for your help.