WongKinYiu / yolov9

Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
GNU General Public License v3.0
8.99k stars 1.42k forks source link

CUDA OUT OF MEMORY #281

Open MuhammadBilal848 opened 8 months ago

MuhammadBilal848 commented 8 months ago

I have set everything for custom training the model and using this command to train the model (I am running this on my laptop):

python train_dual.py --workers 8 --device 0 --batch 8 --data 'LP/data.yaml' --img 640 --cfg models/detect/yolov9-e.yaml --weights 'yolov9-e.pt' --name yolov9-e-finetuning --hyp hyp.scratch-high.yaml --min-items 0 --epochs 10 --close-mosaic 15

Getting this error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 2.62 GiB is free. Of the allocated memory 2.24 GiB is allocated by PyTorch, and 78.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Here's my GPU specs:

image

Youho99 commented 7 months ago

Reduce your batches

And try with using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

MuhammadBilal848 commented 7 months ago

Reduce your batches

And try with using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

I'm using a CLI command, how can I use this PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True with python train_dual.py --workers 8 --device 0 --batch 8 --data 'LP/data.yaml' --img 640 --cfg models/detect/yolov9-e.yaml --weights 'yolov9-e.pt' --name yolov9-e-finetuning --hyp hyp.scratch-high.yaml --min-items 0 --epochs 10 --close-mosaic 15?

Youho99 commented 7 months ago

You need to put it in Environment variable (before running your command)

But try with changing only your batch size also

kuacboss commented 7 months ago

Thank you for the good answer. I am also experiencing the same problem. There are fewer FLOPs and Params than YOLOv8-x. Why does YOLOv8 run, but YOLOv9 gives an error saying there is not enough memory?

shubzk commented 7 months ago

Three things you can try to get you started:

1) Reduce batch size 2) Reduce dataset size 3) In train.py, after line 479 "del ckpt", enter the following two lines torch.cuda.empty_cache() gc.collect()

remember to import gc in the beginning.

Screenshot 2024-03-29 215946

MyGitHub-G commented 1 month ago

I am also experiencing the same problem. Have you solved the problem?