WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.38k stars 4.22k forks source link

RuntimeError: CUDA out of memory #442

Open KendoClaw1 opened 2 years ago

KendoClaw1 commented 2 years ago

iam trying to train a yolov7-tiny module on a custom dataset, iam training on kaggle which offers a free gpu, pytorch allocated more than 90% of the available memory which results in failure of training, i tried to train on my local machine and i had the same error, tried reducing image size/workers/batch-size and still the same result, and i have no problems training with yolov5 using the same exact setup.

my training command: !python train.py --workers 4 --device 0 --batch-size 16 --data /kaggle/working/dataset/config/custom.yaml --img 640--cfg /kaggle/working/dataset/config/yolov7-custom.yaml --weights 'yolov7-tiny.pt' --name yolov7

Why does pytroch allocate most of the GPU memory?

Error logs:

Traceback (most recent call last): File "train.py", line 610, in train(hyp, opt, device, tb_writer) File "train.py", line 361, in train pred = model(imgs) # forward File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/kaggle/working/dataset/yolov7/models/yolo.py", line 587, in forward return self.forward_once(x, profile) # single-scale inference, train File "/kaggle/working/dataset/yolov7/models/yolo.py", line 613, in forward_once x = m(x) # run File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/kaggle/working/dataset/yolov7/models/common.py", line 108, in forward return self.act(self.bn(self.conv(x))) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 394, in forward return F.silu(input, inplace=self.inplace) File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2032, in silu return torch._C._nn.silu(input) RuntimeError: CUDA out of memory. Tried to allocate 1.56 GiB (GPU 0; 15.90 GiB total capacity; 13.66 GiB already allocated; 236.75 MiB free; 14.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

wangzhao-11a commented 2 years ago

decrease batch size

knakanishi24 commented 2 years ago

i also met the same problem I get the same situation after some epochs. Not a solution, but the first time --save_period 10 and after an error occurs, --resume --save_period 10 Then it is possible to continue.

senstar-hsoleimani commented 2 years ago

Same problem here! I changed the batch size to 1, reduced image dim and number of workers,.... still the issue is there. The GPU memory usage changes from iteration to iteration! I played with PYTORCH_CUDA_ALLOC_CONF variable too, but the issue did not go away! I also realized that this happens when number of the classes is high (for example over 20 classes). I tested it with calssNum=3 and it worked like a charm.