'RuntimeError: CUDA out of memory' if using 2 GPUs

NVlabs / stylegan2-ada-pytorch

StyleGAN2-ADA - Official PyTorch implementation

https://arxiv.org/abs/2006.06676

Other

4.12k stars 1.17k forks source link

'RuntimeError: CUDA out of memory' if using 2 GPUs #153

Open sahver opened 3 years ago

sahver commented 3 years ago

I am having issues with training on Windows 10 with multiple GPUs.

If I run train.py with 2 GPUs then I get the following error: RuntimeError: CUDA out of memory. Tried to allocate 9.00 GiB (GPU 0; 24.00 GiB total capacity; 3.34 GiB already allocated; 9.89 GiB free; 11.68 GiB reserved in total by PyTorch)

If i run train.py with 1 GPU then there is no error and training proceeds just fine.

I have no issues with the same configuration running train.py with 1 or 2 GPUs under Ubuntu 20.04.

Any ideas how to solve the issue on Windows?

Command: python train.py --data ... --resume ... --snap 25 --gpus 2

Desktop:

Windows 10
pytorch 1.7.1
CUDA 11.3 (also 11.1, tried both)
NVIDIA driver version: 465.89
RTX 3090 x 2
Docker: no

PDillis commented 3 years ago

You'll most likely have to reduce the batch size, as if you watch the gpus (via e.g. watch -n 0.5 -c gpustat --color), you'll see that the memory is both going up and down a lot at the beginning of training and then it stabilizes. But, it's during one of this ups that it might OOM, so it's better to lower the batch size to prevent this.

A recommended batch size for 1024 images (and 2 GPUs with 24GB of memory each) is 16, though you can use 24 if you want to use more of the VRAM (riskier of course). Also, I don't generate the default grid of fake images as it has a resolution of 8k and is a waste of memory imo, so I just go for 4k or 1080p even (code to do this is here).

thusinh1969 commented 3 years ago

You'll most likely have to reduce the batch size, as if you watch the gpus (via e.g. watch -n 0.5 -c gpustat --color), you'll see that the memory is both going up and down a lot at the beginning of training and then it stabilizes. But, it's during one of this ups that it might OOM, so it's better to lower the batch size to prevent this.

A recommended batch size for 1024 images (and 2 GPUs with 24GB of memory each) is 16, though you can use 24 if you want to use more of the VRAM (riskier of course). Also, I don't generate the default grid of fake images as it has a resolution of 8k and is a waste of memory imo, so I just go for 4k or 1080p even (code to do this is here).

What I found was that once we started, the model jump to the roof on GPU memory but actually use 2/3 only on training. How do we avoid the OOM at the beginning ?

Steve

Essamara commented 2 years ago

48GB RAM and using --cfg paper256 worked for me 32GB ram ran out of memory I'm using 2x 3060ti 8GB GPUs