Problem with allocating memory (CUDA out of memory)

Einhartd commented 7 months ago

Dist: Pop OS 22.04 nvidia-smi output:

Tue Nov 28 23:18:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P8              10W /  60W |      9MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2686      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I have encountered an error while trying to train my first model on YOLO3D model. I have simply followed instruction in docs/mono3d.md After entering this command: ./launchers/train.sh config/CONFIG_FILE_YOLO.py 0 proba program crashed with error below:

Traceback (most recent call last):
  File "/home/einhart/visualDet3D/scripts/train.py", line 199, in <module>
    Fire(main)
  File "/home/einhart/anaconda3/envs/visualDet3D/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/einhart/anaconda3/envs/visualDet3D/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/einhart/anaconda3/envs/visualDet3D/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/einhart/visualDet3D/scripts/train.py", line 150, in main
    training_dection(data, detector, optimizer, writer, training_loss_logger, global_step, epoch_num, cfg)
  File "/home/einhart/visualDet3D/visualDet3D/networks/pipelines/trainers.py", line 35, in train_mono_detection
    classification_loss, regression_loss, loss_dict = module(
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/detectors/yolomono3d_detector.py", line 126, in forward
    return self.training_forward(img_batch, annotations, calib)
  File "/home/einhart/visualDet3D/visualDet3D/networks/detectors/yolomono3d_detector.py", line 91, in training_forward
    features  = self.core(dict(image=img_batch, P2=P2))
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/detectors/yolomono3d_core.py", line 16, in forward
    x = self.backbone(x['image'])
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/backbones/resnet.py", line 195, in forward
    x = layer(x)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/backbones/resnet.py", line 82, in forward
    out = self.conv3(out)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacty of 3.81 GiB of which 6.31 MiB is free. Including non-PyTorch memory, this process has 3.79 GiB memory in use. Of the allocated memory 3.60 GiB is allocated by PyTorch, and 89.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried reducing batch_size value from 8 to 2 in file CONFIG_FILE_YOLO.py which was generated by this command: cp Yolo3D_example $CONFIG_FILE.py

Have you got any idea how to reduce allocated memory ?

Owen-Liuyuxuan commented 7 months ago

I have not tried training the network with 4GB of memory. You could try modifying the backbone to Resnet50. Or further minimizing the batch size to 1 (you may need to tune learning rate after this).

But it will be difficult to reproduce the full result.

Einhartd commented 7 months ago

It worked! Thank you so much!

Owen-Liuyuxuan / visualDet3D

Problem with allocating memory (CUDA out of memory) #85