Open AlexBlck opened 2 years ago
Try running train.py with the —nobench
option. The cuDNN benchmarking at init time is a memory hog.
Also, --cfg=stylegan3-r
is the largest one, and unless you don't mind some symmetry artifacts and slow training, plus you don't really need/mind rotational equivariance, I would suggest you either go with --cfg-stylegan3-t
or --cfg=stylegan2
.
Try running train.py with the
—nobench
option
This made it survive long enough to initialize everything, but still OOM right at Training for 5000 kimg...
Also,
--cfg=stylegan3-r
is the largest one
I know it's the largest one, but it should still fit into 8xV100, right? That's the hardware used in the paper and reported GPU mem is 10GB per GPU, while mine is twice that number.
I just tried --cfg=stylegan2
to see how much that would take and it's still running out of memory. I'm starting to think that my data is the problem somehow.. which would be strange, since it's the same resolution
Yeah, sorry, didn't really thoroughly read first. There are high upticks in memory at the beginning, but then the memory usage per GPU should lower once you start training, but the GPU with highest usage is the first one always. It is bizarre that it doesn't fit into 8 V100 though, even the --cfg=stylegan2
. Have you watched the memory usage with e.g. gpustat
? I like to use gpustat -cup --watch
.
When training, tick 0
has the largest memory usage, then it goes down to ~10GB per GPU. This is what I get but with 2 GPUs (A40) and a dataset of 512x512, so I ended up using --batch=16
:
tick 0 kimg 1240.0 time 37s sec/tick 6.3 sec/kimg 391.28 maintenance 30.3 cpumem 4.79 gpumem 34.92 reserved 39.22 augment 0.000
tick 1 kimg 1244.0 time 6m 08s sec/tick 327.6 sec/kimg 81.91 maintenance 4.0 cpumem 4.88 gpumem 11.29 reserved 35.05 augment 0.036
gpumem
goes from ~35 Gb down to ~11 Gb. I assume this is because of the custom ops and perhaps the CuDNN benchmark? This could be due to many things, so check both the log.txt
and training_options.json
that everything is behaving nicely and that your command is actually being followed.
I'm also training on this configurations with a custom ds and what i always see on my traceback is a call to accumulate_gradients from loss.py. Maybe the problem is there?
the first time it happened I was running with--batch=32
and it wouldn't even start. It already threw a CUDNN error. I then changed the batch size to 16 and it was able to start training. The first test I hat --snap=2
and it was running relatively fine but too slow due to metrics calculation. I changed to --snap=15
and it threw a RuntimeError:CUDA out of memory
just after tick 5, I then changed to --snap=5
and it was able to run until tick 10, evaluated the metrics and when it was supposed to start tick 11 it threw the same error.
I'm now trying to run with --nobench=True
to see if something changes.
I'm running the model in 4xV100
I'm also training on this configurations with a custom ds and what i always see on my traceback is a call to accumulate_gradients from loss.py. Maybe the problem is there?
the first time it happened I was running with
--batch=32
and it wouldn't even start. It already threw a CUDNN error. I then changed the batch size to 16 and it was able to start training. The first test I hat--snap=2
and it was running relatively fine but too slow due to metrics calculation. I changed to--snap=15
and it threw aRuntimeError:CUDA out of memory
just after tick 5, I then changed to--snap=5
and it was able to run until tick 10, evaluated the metrics and when it was supposed to start tick 11 it threw the same error.I'm now trying to run with
--nobench=True
to see if something changes.I'm running the model in 4xV100
So u find any solutions of avoiding OOM error?
bumping this
Hi,
I tried to run the config, recommended for MetFaces-U at 1024x1024 resolution, but on my own dataset. On 8xV100 it was running out of memory, so I tried on 4xA6000. Turns out, it's taking up ~40GB per GPU, which is quite a bit higher than reported ~10GB.
The command I'm running:
python train.py --outdir=training-runs --cfg=stylegan3-r --data=/home/ubuntu/sg/data/v1 --gpus=4 --batch=32 --gamma=6.6 --mirror=1 --kimg=5000 --snap=5 --metrics=none --resume=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-ffhqu-1024x1024.pkl
And here is my nvidia-smi output:![image](https://user-images.githubusercontent.com/39992844/164990385-710cc7dc-42f5-4317-8bd1-9017191f6539.png)
On 8xV100 I tried changing
--batch-gpu
until it finally ran, but then 7 of the GPUs were using ~5GB, but the first one was running out of memory.Am I doing something wrong?