RuntimeError: CUDA out of memory

simone-porcu commented 4 years ago

I'm running the training with default --batch_size 8 and I get:

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 15.75 GiB total capacity; 14.58 GiB already a llocated; 22.88 MiB free; 14.75 GiB reserved in total by PyTorch)

Server details:

GPU: 1 x NVIDIA Tesla V100
n1-highmem-4 (4 vCPU, 26 GB memory)

running this training on Google Cloud Platform.

yunjey commented 4 years ago

@SirQuickWay Please write the full error message.

surpel commented 4 years ago

Hi, I got same problem, when I tried to train on Tesla P100 with default setting. I had supervised GPU memory usage by gpustat toolkit. The memory jumped like an elevator and OOM when it was over 16000mb. All processing message is

Number of parameters of generator: 43467395
Number of parameters of mapping_network: 2438272
Number of parameters of style_encoder: 20916928
Number of parameters of discriminator: 20852290
Number of parameters of fan: 6333603m
Initializing generator...
Initializing mapping_network...
Initializing style_encoder...
Initializing discriminator...
Preparing DataLoader to fetch source images during the training phase...
Preparing DataLoader to fetch reference images during the training phase...
Preparing DataLoader for the generation phase...
Start training...
/home/*****/anaconda3/envs/stargan-v2/lib/python3.6/site-packages/torch/nn/functional.py:2506: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  "See the documentation of nn.Upsample for details.".format(mode))
Traceback (most recent call last):
  File "main.py", line 182, in <module>
    main(args)
  File "main.py", line 59, in main
    solver.train(loaders)
  File "/data/home/*****/project/stargan-v2/core/solver.py", line 131, in train
    nets, args, x_real, y_org, y_trg, x_refs=[x_ref, x_ref2], masks=masks)
  File "/data/home/*****/project/stargan-v2/core/solver.py", line 273, in compute_g_loss
    x_rec = nets.generator(x_fake, s_org, masks=masks)
  File "/home/*****/anaconda3/envs/stargan-v2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/home/*****/project/stargan-v2/core/model.py", line 181, in forward
    x = block(x, s)
  File "/home/*****/anaconda3/envs/stargan-v2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/home/*****/project/stargan-v2/core/model.py", line 117, in forward
    out = self._residual(x, s)
  File "/data/home/*****/project/stargan-v2/core/model.py", line 110, in _residual
    x = self.conv1(x)
  File "/home/*****/anaconda3/envs/stargan-v2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/*****/anaconda3/envs/stargan-v2/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/*****/anaconda3/envs/stargan-v2/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 15.90 GiB total capacity; 14.89 GiB already allocated; 127.38 MiB free; 15.12 GiB reserved in total by PyTorch)

and another question, how much memory I need for training with a higher resolution like 512x512? Many thanks!

I had tried a smaller batchsize as @eps696 said and it works. But I'm not sure that how much will it affect the performance. I'll follow-up this problem.

eps696 commented 4 years ago

try smaller batching as an example, i've trained 512x512 model on 8gb gpu with batch_size=1

wenhe-jia commented 4 years ago

try smaller batching as an example, i've trained 512x512 model on 8gb gpu with batch_size=1

I also have GPU memory limitation, on a GPU with 12GB memory, I have half the batch_size and lr, as well double the iter, can I get the same model performance compared with batch_size=8?

eps696 commented 4 years ago

@LeonJWH alas, performance-wise batch size seems to be quite important. i've ended up at size 256 and batch 4 (the most i could get on 11gb geforce card) - the results look a bit worse than original model with batch 8, but usable on practice. i didn't play with lr yet though.

lmarson94 commented 4 years ago

maybe they used a Tesla v100 32GB for training? It would be helpful if the authors clarify this.

doantientai commented 4 years ago

try smaller batching as an example, i've trained 512x512 model on 8gb gpu with batch_size=1

Hi @eps696, I am training a 512x512 dataset with batch_size=1 too. It takes less than 7GB for training. However, the program crashed when it comes to sampling step (sample_every = 5000 in my case). Don't you have the same problem?

...
Elapsed time [1:47:31], Iteration [5000/100000], D/latent_real: [1.8028] D/latent_fake: [0.1497] D/latent_reg: [0.0062] D/ref_real: [0.0631] D/ref_fake: [1.0272] D/ref_reg: [0.0052] G/latent_adv: [2.9588] G/latent_sty: [0.4823] G/latent_ds: [0.1133] G/latent_cyc: [0.2219] G/ref_adv: [1.6497] G/ref_sty: [0.1405] G/ref_ds: [0.0690] G/ref_cyc: [0.2233] G/lambda_ds: [0.9500]
Traceback (most recent call last):

  File "main.py", line 257, in <module>
    main(args)
  File "main.py", line 59, in main
    solver.train(loaders)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/core/solver.py", line 162, in train
    utils.debug_image(nets_ema, args, inputs=inputs_val, step=i+1)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/core/utils.py", line 139, in debug_image
    translate_using_latent(nets, args, x_src, y_trg_list, z_trg_list, psi, filename)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/core/utils.py", line 94, in translate_using_latent
    x_fake = nets.generator(x_src, s_trg, masks=masks)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/core/model.py", line 181, in forward
    x = block(x, s)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/core/model.py", line 119, in forward
    out = (out + self._shortcut(x)) / math.sqrt(2)
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/core/model.py", line 100, in _shortcut
    x = F.interpolate(x, scale_factor=2, mode='nearest')
  File "/opt/deeplearning/tai/StarGanV2/Source/stargan-v2/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 2512, in interpolate
    return torch._C._nn.upsample_nearest2d(input, _output_size(2))
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 10.92 GiB total capacity; 3.80 GiB already allocated; 1.39 GiB free; 8.95 GiB reserved in total by PyTorch)

eps696 commented 4 years ago

@doantientai i've added del x_fake2, y_org, y_trg, s_trg2, s_trg, s_pred, out before this line, that's probably all

doantientai commented 4 years ago

@doantientai i've added del x_fake2, y_org, y_trg, s_trg2, s_trg, s_pred, out before this line, that's probably all

Thank you. I tried that but it didn't solve the problem. What a shame that I have 11Gb memory in GPU 😭

doantientai commented 4 years ago

@doantientai i've added del x_fake2, y_org, y_trg, s_trg2, s_trg, s_pred, out before this line, that's probably all

Thank you. I tried that but it didn't solve the problem. What a shame that I have 11Gb memory in GPU

So I did some digging and found out that this guy: x_concat = [x_src] concatenates a lot of images and they are all on gpu. So I switch them to cpu and replace torch operations with numpy. Now I have no more OOM problem at sampling step. Hope this can help somebody.

fluctlux commented 4 years ago

@doantientai i've added del x_fake2, y_org, y_trg, s_trg2, s_trg, s_pred, out before this line, that's probably all

Thank you. I tried that but it didn't solve the problem. What a shame that I have 11Gb memory in GPU

So I did some digging and found out that this guy: x_concat = [x_src] concatenates a lot of images and they are all on gpu. So I switch them to cpu and replace torch operations with numpy. Now I have no more OOM problem at sampling step. Hope this can help somebody.

I think set batch size to 4 and reduce val_batch_size from default 32 to 8 will help.

sumeyyegsu commented 3 years ago

I resized images (from 1024x1024 to 128x128), set batch size to 2 and val_batch_size to 8. I want at least to see that I could train it. However, it is always difficult to train on my computer. Do you have any additional idea?

ugurcansakizli commented 3 years ago

you can try using cloud computing for training. https://www.youtube.com/channel/UCaZuPdmZ380SFUMKHVsv_AA this channel has detailed instructions on how you can do it online using Google's free GPUs. Kolay gelsin ;D

sumeyyegsu commented 3 years ago

you can try using cloud computing for training. https://www.youtube.com/channel/UCaZuPdmZ380SFUMKHVsv_AA this channel has detailed instructions on how you can do it online using Google's free GPUs. Kolay gelsin ;D

Thank you, but I already use google colab for this.

newday233 commented 3 years ago

@doantientai i've added del x_fake2, y_org, y_trg, s_trg2, s_trg, s_pred, out before this line, that's probably all

Thank you. I tried that but it didn't solve the problem. What a shame that I have 11Gb memory in GPU

So I did some digging and found out that this guy: x_concat = [x_src] concatenates a lot of images and they are all on gpu. So I switch them to cpu and replace torch operations with numpy. Now I have no more OOM problem at sampling step. Hope this can help somebody.

hello， could you please share the code？

clovaai / stargan-v2

RuntimeError: CUDA out of memory #29