clovaai / tunit

Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)
Other
457 stars 48 forks source link

I'm not sure why summer2winter CUDA out of memory. #11

Closed kimtaehyeong closed 4 years ago

kimtaehyeong commented 4 years ago

Thank you for the awesome work.

I am now learning custom data in an unsupervised way. But there was a problem. The cuda out of memory occurs the moment you move to epoch 69->70.

'RuntimeError: CUDA out of memory. Tried to allocate 20.61 GiB (GPU 0; 23.65 GiB total capacity; 3.54 GiB already allocated; 17.74 GiB free; 1.55 GiB cached)'

The command I executed is:'python main.py --gpu 0 --dataset summer2winter --output_k 2 --data_path'../data' --p_semi 0.0 --img_size 256 --batch_size 1 --ddp'

Also, my gpu is TITAN RTX 1.

Thanks for letting me know about this issue.

FriedRonaldo commented 4 years ago

Hi, thanks for your interest.

The code should run at least batch_size=8 or 4 on GPU with 24G memory.

I tried running the code with batch_size=2 and img_size=256 on RTX 2080 Ti (12G memory) - also tried with --ddp. However, it works well... Specifically, I ran the code with "python main.py --gpu 0 --batch_size 2 --img_size 256 --dataset summer2winter --ddp", then, the memory usage is about 6.5Gigs.

I cannot guess the possible cause of weird OOM. If it raises at 65 or 66 epoch, it might be from the starting of GANs training or FID calculation, but this is not the case.

Can you provide more details? (e.g. modification on code or something)


EDIT I successfully reproduce the error! I did not notice that I used the old code at the beginning of the reply on this issue.

The problem is L117 in validation.py("https://github.com/clovaai/tunit/blob/master/validation/validation.py#L117"). This line utilizes much of memory.

There are two possible options:

  1. Just comment out the line 98 ~ 124 solves the issue immediately.
  2. Reduce the number of samples in cluster_grid after https://github.com/clovaai/tunit/blob/master/validation/validation.py#L108

I will modify the error. Thanks for letting me know about the issue!

kimtaehyeong commented 4 years ago

Thanks for the reply.

As you have said, you are learning after commenting on the 98~124 line.

If the image size can be put in 256, 128, or 512 format, can the size of the created image be 512 or 256? The output always seems to be 128.

FriedRonaldo commented 4 years ago

I did not try with 512 because of OOM but the code works with 128 and 256 and provides 256x256 images also.

Can you provide the code line that makes image sizes 128? I can not find the code lines that make the outputs 128. Did you check the size of the output image? It is 2840x2840 for me. (256 for each image)

kimtaehyeong commented 4 years ago

At present, summar to winter is annotated and the learning is in progress.

The previous command was: python main.py --gpu 0 --dataset afhq_cat --output_k 10 --data_path ./data --p_semi 0.2 --img_size 256 With the above command, learning is finished I wanted to create an image with a reference image.

So I did it with the following command.

However, when I just modified the code and ran it, the size of the 256 image came out. I did the following, but was it correct?


import torch from models.generator import Generator from models.guidingNet import GuidingNet import torch.nn.functional as F import torchvision.utils as vutils from PIL import Image from torchvision.transforms import ToTensor G = Generator(256, 128) C = GuidingNet(256) load_file ='./logs/GAN_20200712-010938/model_190.ckpt' checkpoint = torch.load(load_file, map_location='cpu') G.load_state_dict(checkpoint['G_EMA_state_dict']) C.load_state_dict(checkpoint['C_EMA_state_dict']) G.eval() C.eval()

source_image = Image.open('365_A.png') reference_image = Image.open('133_B.png')

x_src = ToTensor()(source_image).unsqueeze(0) x_ref = ToTensor()(reference_image).unsqueeze(0)

x_src = F.interpolate(x_src, size=(256, 256)) x_ref = F.interpolate(x_ref, size=(256, 256))

x_src = (x_src-0.5) / 0.5 x_ref = (x_ref-0.5) / 0.5

s_ref = C.moco(x_ref) x_res = G(x_src, s_ref)

vutils.save_image(x_res,'test_out_190.jpg', normalize=True, padding=0)


The result was 256 images, The part I noticed is 'G = Generator(256, 128)'. If I set it to 256,256, I get an error.

FriedRonaldo commented 4 years ago

I cannot understand the meaning of your question... If your intention is that a model trained with 256x256 can provide 128 or 512 images, the answer is no.

It is correct that the model provides 256x256 images when you train the model with 256x256 images.


The result was 256 images, The part I noticed is 'G = Generator(256, 128)'. If I set it to 256,256, I get an error.

The arguments list of Generator is: class Generator(nn.Module):
def init(self, img_size=128, sty_dim=64, n_res=2, use_sn=False):

First argument is the image size and the second is style dimension. So if you change the second argument, it should raise an error, therefore it is not an issue. Please check the argument list of Generator.

Generator(256,128) does not mean that the input image is 256x128 but 256x256.


EDIT If you trained the model with 256x256(and initialized the generator with Generator(256, X)), the output will always be 256) -> Not always but it can provide 512x512 even though the model is trained with 256x256. However, in this case, the performance might be degraded.

kimtaehyeong commented 4 years ago

With your help, I solved this problem. Thank you so much!