Closed Txrachel closed 5 years ago
I'm observing similar behaviour as well. I can train only with a batch size of 1. The GPU memory isn't fully utilized either. I'm training on a GTX 1080; vram is 8gb.
can't help with this one, would suggest to make sure that no other GPU processes are being run alongside. I think with batch size of 1 1080 should be enough, for reference I am using 1080Ti with the batch size of 6
I realize you can't help, but I am also getting this error. I am using Nvidia Quadro P4000 with 8 GB vram.
The Task Manager shows very low GPU memory usage until the program prints:
Train epoch: 0 [0/132] Avg. Loss: 3.711 Avg. Time: 2.425
Then the GPU memory usage jumps in under a second to over 90% and throws the error:
File "C:\Users\rfairhur\Documents\Jupyter Notebooks\light-weight-refinenet-master\src\train.py", line 425, in
I believe my batch size is set to 1. Anyway, I will search Google about this error to see if there is anything I can try.
Apparently I was wrong about my batch size setting. It must have been set to 6 or higher, because when I made sure it was set to 5 or less the training ran successfully, but it failed if I set the batch size to 6.
Hi, Thanks for your wonderful work and detailed tutorial. I am just a fresh new here, when I try to retrain the model, there will be RuntimeError. Then I set the Batch_Size (config.py) to [1] * 3, it also can't work. I wonder if you have ever met this problem? Could you please help me? Thanks in advance!
INFO:main: Train epoch: 0 [0/795] Avg. Loss: 3.751 Avg. Time: 1.046 Traceback (most recent call last): File "src/train.py", line 425, in
main()
File "src/train.py", line 409, in main
args.freeze_bn[task_idx])
File "src/train.py", line 273, in train_segmenter
output = segmenter(input_var)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], *kwargs[0])
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(input, kwargs)
File "/home/txr/SS/light-weight-refinenet/models/resnet.py", line 237, in forward
x1 = self.mflow_conv_g4_pool(x1)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, *kwargs)
File "/home/txr/SS/light-weight-refinenet/utils/layer_factory.py", line 72, in forward
top = self.maxpool(top)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(input, kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/modules/pooling.py", line 146, in forward
self.return_indices)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/_jit_internal.py", line 133, in fn
return if_false(*args, **kwargs)
File "/home/txr/.virtualenvs/env_test/lib/python3.5/site-packages/torch/nn/functional.py", line 494, in _max_pool2d
input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 1.96 GiB total capacity; 1.14 GiB already allocated; 20.06 MiB free; 41.52 MiB cached)