Open KendoClaw1 opened 2 years ago
decrease batch size
i also met the same problem I get the same situation after some epochs. Not a solution, but the first time --save_period 10 and after an error occurs, --resume --save_period 10 Then it is possible to continue.
Same problem here! I changed the batch size to 1, reduced image dim and number of workers,.... still the issue is there. The GPU memory usage changes from iteration to iteration! I played with PYTORCH_CUDA_ALLOC_CONF variable too, but the issue did not go away! I also realized that this happens when number of the classes is high (for example over 20 classes). I tested it with calssNum=3 and it worked like a charm.
iam trying to train a yolov7-tiny module on a custom dataset, iam training on kaggle which offers a free gpu, pytorch allocated more than 90% of the available memory which results in failure of training, i tried to train on my local machine and i had the same error, tried reducing image size/workers/batch-size and still the same result, and i have no problems training with yolov5 using the same exact setup.
my training command: !python train.py --workers 4 --device 0 --batch-size 16 --data /kaggle/working/dataset/config/custom.yaml --img 640--cfg /kaggle/working/dataset/config/yolov7-custom.yaml --weights 'yolov7-tiny.pt' --name yolov7
Why does pytroch allocate most of the GPU memory?
Error logs:
Traceback (most recent call last): File "train.py", line 610, in
train(hyp, opt, device, tb_writer)
File "train.py", line 361, in train
pred = model(imgs) # forward
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, kwargs)
File "/kaggle/working/dataset/yolov7/models/yolo.py", line 587, in forward
return self.forward_once(x, profile) # single-scale inference, train
File "/kaggle/working/dataset/yolov7/models/yolo.py", line 613, in forward_once
x = m(x) # run
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, *kwargs)
File "/kaggle/working/dataset/yolov7/models/common.py", line 108, in forward
return self.act(self.bn(self.conv(x)))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(input, kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 394, in forward
return F.silu(input, inplace=self.inplace)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2032, in silu
return torch._C._nn.silu(input)
RuntimeError: CUDA out of memory. Tried to allocate 1.56 GiB (GPU 0; 15.90 GiB total capacity; 13.66 GiB already allocated; 236.75 MiB free; 14.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF