Training time is very slow with Nvidia A100

arunraman commented 3 years ago

I trained the stylegan2-ada in the Nvidia A100 with the follow in cuda version

NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0

Even though I was able to train the model, it was very slow. I was only able to get 45 -60 secs/kimg and was averaging around 170-200 secs/tick whereas in stylegan2 I was able to get this more efficient with 16.6 secs/kimg and 68.4 secs/tick. I am trying to understand what's causing this delay here.

Also in the A100, with stylegan2-ada, even though I changed the batch size from 64 to 128 in the train.py by hardcoding it on this line,

args.minibatch_size = 128

I was not able to use the entire 40Gb of memory. The GPU memory utilization gets capped at 18Gb and I am not able to push this further even though I have another 12Gb. What parameter other than batch size should I change to fix this?

johndpope commented 3 years ago

are you using docker? there is a fix here https://github.com/NVlabs/stylegan2-ada/pull/51 if not - I have a branch using tensorflow-2 compatibiility mode (it includes other branches that have been cherry picked) https://github.com/johndpope/stylegan2-ada/tree/digressions

there's also some options to tinker with Default GPU-based configs: added configs to maximize GPU usage for 11GB. 24GB, and 48GB cards (use 11GB for 16GB cards)

arunraman commented 3 years ago

Closing this as it's obsolete. I Will try the PyTorch port when it comes out.

NVlabs / stylegan2-ada

Training time is very slow with Nvidia A100 #78