RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 39.59 GiB total capacity; 33.31 GiB already allocated; 1.06 GiB free; 36.81 GiB reserved in total by PyTorch)

asagar60 commented 2 years ago

I trying to generate Images using pretrained StyleGAN2-SPD-ADA , but this error is coming which i initially thought was due to 15 GB GPU of colab , but i tried with 24, and 40 GB GPU still getting the same error

I tried reducing the batchsize from 64-> 32 --> 16 .. still the same

code : !python PyTorch-StudioGAN/src/main.py -t -v -ckpt StyleGAN2-SPD-ADA-train-2021_10_18_16_01_19 -cfg PyTorch-StudioGAN/src/configs/AFHQ/StyleGAN2-SPD-ADA.yaml -save gen -data afhq -best

Logs:--

[INFO] 2022-04-30 07:22:26 > Generator checkpoint is StyleGAN2-SPD-ADA-train-2021_10_18_16_01_19/model=G-best-weights-step=196000.pth [INFO] 2022-04-30 07:22:26 > EMA_Generator checkpoint is StyleGAN2-SPD-ADA-train-2021_10_18_16_01_19/model=G_ema-best-weights-step=196000.pth [INFO] 2022-04-30 07:22:26 > Discriminator checkpoint is StyleGAN2-SPD-ADA-train-2021_10_18_16_01_19/model=D-best-weights-step=196000.pth /opt/conda/lib/python3.8/site-packages/torchvision/models/inception.py:44: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True. warnings.warn( wandb: Currently logged in as: asagar60 (use wandb login --relogin to force relogin) wandb: Tracking run with wandb version 0.12.15 wandb: Run data is saved locally in gen/wandb/run-20220430_072228-1zr43u7c wandb: Run wandb offline to turn off syncing. wandb: Resuming run StyleGAN2-SPD-ADA-train-2021_10_18_16_01_19 wandb: ⭐️ View project at https://wandb.ai/asagar60/uncategorized wandb: 🚀 View run at https://wandb.ai/asagar60/uncategorized/runs/1zr43u7c [INFO] 2022-04-30 07:22:29 > Start training! Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "upfirdn2d_plugin"... Done. Traceback (most recent call last): File "PyTorch-StudioGAN/src/main.py", line 182, in loader.load_worker(local_rank=rank, File "/home/PyTorch-StudioGAN/src/loader.py", line 348, in load_worker gen_acml_loss = worker.train_generator(current_step=step) File "/home/PyTorch-StudioGAN/src/worker.py", line 564, in train_generator fake_dict = self.Dis(fakeimages, fake_labels) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/PyTorch-StudioGAN/src/models/stylegan2.py", line 849, in forward x, img = block(x, img, block_kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, *kwargs) File "/home/PyTorch-StudioGAN/src/models/stylegan2.py", line 648, in forward x = self.conv0(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/PyTorch-StudioGAN/src/models/stylegan2.py", line 176, in forward x = conv2d_resample.conv2d_resample(x=x, File "/home/PyTorch-StudioGAN/src/utils/style_ops/conv2d_resample.py", line 133, in conv2d_resample return _conv2d_wrapper(x=x, w=w, padding=[py0,px0], groups=groups, flip_weight=flip_weight) File "/home/PyTorch-StudioGAN/src/utils/style_ops/conv2d_resample.py", line 41, in _conv2d_wrapper return op(x, w, stride=stride, padding=padding, groups=groups) File "/home/PyTorch-StudioGAN/src/utils/style_ops/conv2d_gradfix.py", line 37, in conv2d return _conv2d_gradfix(transpose=False, weight_shape=weight.shape, stride=stride, padding=padding, output_padding=0, dilation=dilation, groups=groups).apply(input, weight, bias) File "/home/PyTorch-StudioGAN/src/utils/style_ops/conv2d_gradfix.py", line 127, in forward return torch.nn.functional.conv2d(input=input, weight=weight, bias=bias, common_kwargs) RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 39.59 GiB total capacity; 33.31 GiB already allocated; 1.06 GiB free; 36.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

alex4727 commented 2 years ago

I think there's a bug regarding -v option. For now, instead of saving it as a canvas (its what -v option does) you can try to save images one by one in png format. To do so, add -sf -sf_num NUMBER_OF_IMAGES_TO_GENERATE options. If you are only planning to generate images, you can omit -t option and specify -metrics none to avoid unnecessary training and evaluation steps. We'll try to fix the bug ASAP. +) Since StyleGAN Models are trained using mixed precision, I also recommend using -mpc in all cases.

lavish619 commented 2 years ago

@alex4727 Hi, You have mentioned in your comment that StyleGAN Models are trained using Mixed Precision, but in the code, wherever mixed-precision is used, an additional condition of not is_stylegan is present, so I was trying to figure out why mixed-precision training is disabled for StyleGAN, and now it confuses me as you mentioned that StyleGAN uses mpc.

It would be very helpful if you could clarify that. Thanks in advance..!!

alex4727 commented 2 years ago

@lavish619
Sorry for late reply, You are correct, wherever mixed-precision is used, an additional condition of not is_stylegan is present. That is because StyleGAN incorporates fp16 datatypes in the model file itself so there's no need of using torch.cuda.amp.autocast() wrapper in the worker. Thanks!

POSTECH-CVLab / PyTorch-StudioGAN

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 39.59 GiB total capacity; 33.31 GiB already allocated; 1.06 GiB free; 36.81 GiB reserved in total by PyTorch) #144