Closed Mayur28 closed 3 years ago
if you change the gpu_ids from 0,1,2
to 0
you will only use single gpu and memory will be 1/3 as original, which is not enough to train the model.
Thanks for the reply.
As mentioned above, I made the above amendment with the hope that the Google Colab GPU is powerful enough (and has sufficient memory) to train the model. Unfortunately, I do not have access to 3 GPU's at the moment but I am atleast glad that the source of the issue is established.
@Mayur28 out of memory is GPU memory issue, change pool size will only help RAM issue. If the gpu you are using has larger than 35GB memory, I think this problem will be solved.
Thank you for the clarification. Will give it a try!
Hi TAMU-VITA,
Your guys work is really impressive!
I've been trying to train your model as specified in the readme file but I keep getting a cuda out of memory error. Because I initially I recieved the same is on my GPU (GTX 1650), I then tried to train it using Google Colab hoping that their GPU is more powerful but the problem persists. I've read through the previous issues that are similar to mine hoping I could find a solution. Following from this, the only changes that I made to try and train the model is to change the gpu_id from '0,1,2' to just '0' since I'm using a single GPU and I tried reducing the pool size to 25. Despite these minor alterations, the problem persists.
The stacktrace that I get is as follows: THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "train.py", line 31, in
model.optimize_parameters(epoch)
File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/single_model.py", line 397, in optimize_parameters
self.backward_G(epoch)
File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/single_model.py", line 334, in backward_G
self.fake_B, self.real_A) self.opt.vgg if self.opt.vgg > 0 else 0
File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/networks.py", line 1028, in compute_vgg_loss
img_fea = vgg(img_vgg, self.opt)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(input, kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 71, in forward
return self.module(*inputs[0], *kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(input, kwargs)
File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/networks.py", line 955, in forward
h = F.relu(self.conv1_1(X), inplace=True)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58
Any help in this regard will be highly appreciated.