NVlabs / stylegan3

Official PyTorch implementation of StyleGAN3
Other
6.3k stars 1.11k forks source link

Very high GPU memory usage #153

Open AlexBlck opened 2 years ago

AlexBlck commented 2 years ago

Hi,

I tried to run the config, recommended for MetFaces-U at 1024x1024 resolution, but on my own dataset. On 8xV100 it was running out of memory, so I tried on 4xA6000. Turns out, it's taking up ~40GB per GPU, which is quite a bit higher than reported ~10GB.

The command I'm running: python train.py --outdir=training-runs --cfg=stylegan3-r --data=/home/ubuntu/sg/data/v1 --gpus=4 --batch=32 --gamma=6.6 --mirror=1 --kimg=5000 --snap=5 --metrics=none --resume=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-ffhqu-1024x1024.pkl

And here is my nvidia-smi output: image

On 8xV100 I tried changing --batch-gpu until it finally ran, but then 7 of the GPUs were using ~5GB, but the first one was running out of memory.

Am I doing something wrong?

nurpax commented 2 years ago

Try running train.py with the —nobench option. The cuDNN benchmarking at init time is a memory hog.

PDillis commented 2 years ago

Also, --cfg=stylegan3-r is the largest one, and unless you don't mind some symmetry artifacts and slow training, plus you don't really need/mind rotational equivariance, I would suggest you either go with --cfg-stylegan3-t or --cfg=stylegan2.

AlexBlck commented 2 years ago

Try running train.py with the —nobench option

This made it survive long enough to initialize everything, but still OOM right at Training for 5000 kimg...

Also, --cfg=stylegan3-r is the largest one

I know it's the largest one, but it should still fit into 8xV100, right? That's the hardware used in the paper and reported GPU mem is 10GB per GPU, while mine is twice that number. I just tried --cfg=stylegan2 to see how much that would take and it's still running out of memory. I'm starting to think that my data is the problem somehow.. which would be strange, since it's the same resolution

PDillis commented 2 years ago

Yeah, sorry, didn't really thoroughly read first. There are high upticks in memory at the beginning, but then the memory usage per GPU should lower once you start training, but the GPU with highest usage is the first one always. It is bizarre that it doesn't fit into 8 V100 though, even the --cfg=stylegan2. Have you watched the memory usage with e.g. gpustat? I like to use gpustat -cup --watch.

When training, tick 0 has the largest memory usage, then it goes down to ~10GB per GPU. This is what I get but with 2 GPUs (A40) and a dataset of 512x512, so I ended up using --batch=16:

tick 0     kimg 1240.0   time 37s          sec/tick 6.3     sec/kimg 391.28  maintenance 30.3   cpumem 4.79   gpumem 34.92  reserved 39.22  augment 0.000
tick 1     kimg 1244.0   time 6m 08s       sec/tick 327.6   sec/kimg 81.91   maintenance 4.0    cpumem 4.88   gpumem 11.29  reserved 35.05  augment 0.036

gpumem goes from ~35 Gb down to ~11 Gb. I assume this is because of the custom ops and perhaps the CuDNN benchmark? This could be due to many things, so check both the log.txt and training_options.json that everything is behaving nicely and that your command is actually being followed.

arrivabene commented 2 years ago

I'm also training on this configurations with a custom ds and what i always see on my traceback is a call to accumulate_gradients from loss.py. Maybe the problem is there?

the first time it happened I was running with--batch=32 and it wouldn't even start. It already threw a CUDNN error. I then changed the batch size to 16 and it was able to start training. The first test I hat --snap=2 and it was running relatively fine but too slow due to metrics calculation. I changed to --snap=15 and it threw a RuntimeError:CUDA out of memory just after tick 5, I then changed to --snap=5 and it was able to run until tick 10, evaluated the metrics and when it was supposed to start tick 11 it threw the same error.

I'm now trying to run with --nobench=True to see if something changes.

I'm running the model in 4xV100

PodoprikhinMaxim commented 2 years ago

I'm also training on this configurations with a custom ds and what i always see on my traceback is a call to accumulate_gradients from loss.py. Maybe the problem is there?

the first time it happened I was running with--batch=32 and it wouldn't even start. It already threw a CUDNN error. I then changed the batch size to 16 and it was able to start training. The first test I hat --snap=2 and it was running relatively fine but too slow due to metrics calculation. I changed to --snap=15 and it threw a RuntimeError:CUDA out of memory just after tick 5, I then changed to --snap=5 and it was able to run until tick 10, evaluated the metrics and when it was supposed to start tick 11 it threw the same error.

I'm now trying to run with --nobench=True to see if something changes.

I'm running the model in 4xV100

So u find any solutions of avoiding OOM error?

bisraelsen commented 1 year ago

bumping this