CUDA out of memory during training

shubzk commented 6 months ago

I am training the yolov7 model on a custom dataset on an Azure ML VM with 2 NVIDIA V100 GPUs. I am using the following code:

python train.py --img 3072 --batch 2 --epochs 10 --data {dataset.location}/data.yaml --weights 'best.pt' --device 0,1 --save_period 1.

However, after 29% on the first epoch, I am getting the following error of CUDA out of memory:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 264.00 MiB. GPU 0 has a total capacity of 15.77 GiB of which 173.12 MiB is free. Including non-PyTorch memory, this process has 15.60 GiB memory in use. Of the allocated memory 14.43 GiB is allocated by PyTorch, and 738.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Please help.

dsbyprateekg commented 6 months ago

@shubzk reduce the image size from 3072 to 1280.

shubzk commented 6 months ago

@dsbyprateekg Thank you. So basically what I have done is increase my compute power with the A100 as images I am using are becoming unusable below 3072.

WongKinYiu / yolov7

CUDA out of memory during training #2005