DrSleep / light-weight-refinenet

Light-Weight RefineNet for Real-Time Semantic Segmentation
Other
739 stars 164 forks source link

About RuntimeError:CUDA out of memory #41

Closed Txrachel closed 5 years ago

Txrachel commented 5 years ago

Hi, Thanks for your wonderful work and detailed tutorial. I am just a fresh new here, when I try to retrain the model, there will be RuntimeError. Then I set the Batch_Size (config.py) to [1] * 3, it also can't work. I wonder if you have ever met this problem? Could you please help me? Thanks in advance!

INFO:main: Train epoch: 0 [0/795] Avg. Loss: 3.751 Avg. Time: 1.046 Traceback (most recent call last): File "src/train.py", line 425, in main() File "src/train.py", line 409, in main args.freeze_bn[task_idx]) File "src/train.py", line 273, in train_segmenter output = segmenter(input_var) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/txr/SS/light-weight-refinenet/models/resnet.py", line 237, in forward x1 = self.mflow_conv_g4_pool(x1) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/txr/SS/light-weight-refinenet/utils/layer_factory.py", line 72, in forward top = self.maxpool(top) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/pooling.py", line 146, in forward self.return_indices) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/_jit_internal.py", line 133, in fn return if_false(*args, **kwargs) File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/functional.py", line 494, in _max_pool2d input, kernel_size, stride, padding, dilation, ceil_mode) RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 1.96 GiB total capacity; 1.14 GiB already allocated; 20.06 MiB free; 41.52 MiB cached)

arindamrc commented 5 years ago

I'm observing similar behaviour as well. I can train only with a batch size of 1. The GPU memory isn't fully utilized either. I'm training on a GTX 1080; vram is 8gb.

DrSleep commented 5 years ago

can't help with this one, would suggest to make sure that no other GPU processes are being run alongside. I think with batch size of 1 1080 should be enough, for reference I am using 1080Ti with the batch size of 6

rfairhurst commented 5 years ago

I realize you can't help, but I am also getting this error. I am using Nvidia Quadro P4000 with 8 GB vram.

The Task Manager shows very low GPU memory usage until the program prints:

Train epoch: 0 [0/132] Avg. Loss: 3.711 Avg. Time: 2.425

Then the GPU memory usage jumps in under a second to over 90% and throws the error:

File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 425, in main() File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 409, in main args.freeze_bn[task_idx]) File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 280, in train_segmenter loss.backward() File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\rfairhur\AppData\Local\Programs\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\torch\autograd__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 230.00 MiB (GPU 0; 8.00 GiB total capacity; 5.81 GiB already allocated; 159.27 MiB free; 333.44 MiB cached)

I believe my batch size is set to 1. Anyway, I will search Google about this error to see if there is anything I can try.

rfairhurst commented 5 years ago

Apparently I was wrong about my batch size setting. It must have been set to 6 or higher, because when I made sure it was set to 5 or less the training ran successfully, but it failed if I set the batch size to 6.