NVlabs / stylegan3

Official PyTorch implementation of StyleGAN3
Other
6.3k stars 1.11k forks source link

increase the efficiency of GPU usage #213

Open PodoprikhinMaxim opened 1 year ago

PodoprikhinMaxim commented 1 year ago

Can someone explain, I have problem with gpu usage while training:

tick 0     kimg 0.0      time 28s          sec/tick 5.0     sec/kimg 1243.95 maintenance 23.5   cpumem 5.17   gpumem 17.68  reserved 20.87  augment 0.000
tick 1     kimg 20.0     time 1h 19m 08s   sec/tick 4711.9  sec/kimg 235.59  maintenance 7.7    cpumem 5.54   gpumem 10.88  reserved 20.23  augment 0.188
tick 2     kimg 40.0     time 2h 37m 04s   sec/tick 4667.8  sec/kimg 233.39  maintenance 7.9    cpumem 4.79   gpumem 10.88  reserved 20.24  augment 0.374
tick 3     kimg 60.0     time 3h 55m 46s   sec/tick 4713.9  sec/kimg 235.69  maintenance 8.1    cpumem 3.33   gpumem 10.99  reserved 20.24  augment 0.543
tick 4     kimg 80.0     time 5h 22m 19s   sec/tick 5185.1  sec/kimg 259.25  maintenance 8.2    cpumem 2.63   gpumem 11.09  reserved 20.25  augment 0.688
tick 5     kimg 100.0    time 6h 48m 03s   sec/tick 5135.8  sec/kimg 256.79  maintenance 8.4    cpumem 2.13   gpumem 11.63  reserved 20.26  augment 0.804
tick 6     kimg 120.0    time 8h 16m 33s   sec/tick 5302.2  sec/kimg 265.11  maintenance 7.7    cpumem 1.73   gpumem 11.10  reserved 20.26  augment 0.901
tick 7     kimg 140.0    time 9h 40m 06s   sec/tick 5004.8  sec/kimg 250.24  maintenance 8.4    cpumem 1.37   gpumem 11.09  reserved 20.27  augment 0.984
tick 8     kimg 160.0    time 10h 59m 36s  sec/tick 4761.6  sec/kimg 238.08  maintenance 8.2    cpumem 1.25   gpumem 11.24  reserved 20.28  augment 1.062
tick 9     kimg 180.0    time 12h 18m 14s  sec/tick 4709.5  sec/kimg 235.47  maintenance 8.1    cpumem 1.23   gpumem 11.25  reserved 20.28  augment 1.146
tick 10    kimg 200.0    time 13h 36m 16s  sec/tick 4674.1  sec/kimg 233.71  maintenance 7.8    cpumem 1.24   gpumem 11.64  reserved 20.29  augment 1.231
tick 11    kimg 220.0    time 14h 54m 19s  sec/tick 4675.1  sec/kimg 233.75  maintenance 7.9    cpumem 1.26   gpumem 11.21  reserved 20.29  augment 1.314
tick 12    kimg 240.0    time 16h 12m 33s  sec/tick 4686.5  sec/kimg 234.33  maintenance 8.2    cpumem 1.26   gpumem 11.14  reserved 20.30  augment 1.392
tick 13    kimg 260.0    time 17h 33m 48s  sec/tick 4866.5  sec/kimg 243.32  maintenance 7.9    cpumem 1.22   gpumem 11.50  reserved 20.30  augment 1.476

At the 0 tick there is 17 gb gpu usage, but than it decreases drastically. And increase of batch size don't help because of not enough memory error at 0 tick. Is there any way to fix it?

DQSSSSS commented 1 year ago

I meet the same problem, I guess the reason is the CUDA version, but I can't confirm it... See https://github.com/autonomousvision/stylegan_xl/issues/90 I use the docker image pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel