Open sahver opened 3 years ago
You'll most likely have to reduce the batch size, as if you watch the gpus (via e.g. watch -n 0.5 -c gpustat --color
), you'll see that the memory is both going up and down a lot at the beginning of training and then it stabilizes. But, it's during one of this ups that it might OOM, so it's better to lower the batch size to prevent this.
A recommended batch size for 1024 images (and 2 GPUs with 24GB of memory each) is 16, though you can use 24 if you want to use more of the VRAM (riskier of course). Also, I don't generate the default grid of fake images as it has a resolution of 8k and is a waste of memory imo, so I just go for 4k or 1080p even (code to do this is here).
You'll most likely have to reduce the batch size, as if you watch the gpus (via e.g.
watch -n 0.5 -c gpustat --color
), you'll see that the memory is both going up and down a lot at the beginning of training and then it stabilizes. But, it's during one of this ups that it might OOM, so it's better to lower the batch size to prevent this.A recommended batch size for 1024 images (and 2 GPUs with 24GB of memory each) is 16, though you can use 24 if you want to use more of the VRAM (riskier of course). Also, I don't generate the default grid of fake images as it has a resolution of 8k and is a waste of memory imo, so I just go for 4k or 1080p even (code to do this is here).
What I found was that once we started, the model jump to the roof on GPU memory but actually use 2/3 only on training. How do we avoid the OOM at the beginning ?
Steve
48GB RAM and using --cfg paper256 worked for me 32GB ram ran out of memory I'm using 2x 3060ti 8GB GPUs
I am having issues with training on Windows 10 with multiple GPUs.
If I run train.py with 2 GPUs then I get the following error: RuntimeError: CUDA out of memory. Tried to allocate 9.00 GiB (GPU 0; 24.00 GiB total capacity; 3.34 GiB already allocated; 9.89 GiB free; 11.68 GiB reserved in total by PyTorch)
If i run train.py with 1 GPU then there is no error and training proceeds just fine.
I have no issues with the same configuration running train.py with 1 or 2 GPUs under Ubuntu 20.04.
Any ideas how to solve the issue on Windows?
Command: python train.py --data ... --resume ... --snap 25 --gpus 2
Desktop: