Running out of memory when training

Mayur28 commented 3 years ago

Hi TAMU-VITA,

Your guys work is really impressive!

I've been trying to train your model as specified in the readme file but I keep getting a cuda out of memory error. Because I initially I recieved the same is on my GPU (GTX 1650), I then tried to train it using Google Colab hoping that their GPU is more powerful but the problem persists. I've read through the previous issues that are similar to mine hoping I could find a solution. Following from this, the only changes that I made to try and train the model is to change the gpu_id from '0,1,2' to just '0' since I'm using a single GPU and I tried reducing the pool size to 25. Despite these minor alterations, the problem persists.

The stacktrace that I get is as follows: THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "train.py", line 31, in model.optimize_parameters(epoch) File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/single_model.py", line 397, in optimize_parameters self.backward_G(epoch) File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/single_model.py", line 334, in backward_G self.fake_B, self.real_A) self.opt.vgg if self.opt.vgg > 0 else 0 File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/networks.py", line 1028, in compute_vgg_loss img_fea = vgg(img_vgg, self.opt) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 71, in forward return self.module(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(input, kwargs) File "/content/Hello-World/EnlightenGAN-master/EnlightenGAN-master/models/networks.py", line 955, in forward h = F.relu(self.conv1_1(X), inplace=True) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 282, in forward self.padding, self.dilation, self.groups) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 90, in conv2d return f(input, weight, bias) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

Any help in this regard will be highly appreciated.

yifanjiang19 commented 3 years ago

if you change the gpu_ids from 0,1,2 to 0 you will only use single gpu and memory will be 1/3 as original, which is not enough to train the model.

Mayur28 commented 3 years ago

Thanks for the reply.

As mentioned above, I made the above amendment with the hope that the Google Colab GPU is powerful enough (and has sufficient memory) to train the model. Unfortunately, I do not have access to 3 GPU's at the moment but I am atleast glad that the source of the issue is established.

yifanjiang19 commented 3 years ago

@Mayur28 out of memory is GPU memory issue, change pool size will only help RAM issue. If the gpu you are using has larger than 35GB memory, I think this problem will be solved.

Mayur28 commented 3 years ago

Thank you for the clarification. Will give it a try!

VITA-Group / EnlightenGAN

Running out of memory when training #62