ajbrock / BigGAN-PyTorch

The author's officially unofficial PyTorch BigGAN implementation.
MIT License
2.84k stars 470 forks source link

Some guidelines for better gpu perfomance. #56

Open ysig opened 4 years ago

ysig commented 4 years ago

Hi,

I am trying to train on a custom dataset using your algorithm. I try to increase batch size up to the point that my script doesn't break. I am running a script really similar to launch_BigGAN_ch64_bs256x8.sh. I see that the memory being allocated is twice as much as it is used (~5GB is utilised and ~10GB is allocated). Also after step 1000 the model broke as it tried to allocate more memory (I guess for predict, but why is that?).

I use 4 Titans with 11GB each and although you say the opposite, I would like you to suggest, if you have any idea of how I can use this already really powerful system to train your model.

Thanks a lot for your code and in advance for your time!

ysig commented 4 years ago

This is my script:

#!/bin/bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py \
--dataset I256_hdf5 --parallel --shuffle  --num_workers 8 --batch_size 64  \
--num_G_accumulations 4 --num_D_accumulations 4 \
--num_D_steps 1 --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 \
--G_attn 64 --D_attn 64 \
--G_nl inplace_relu --D_nl inplace_relu \
--SN_eps 1e-6 --BN_eps 1e-5 --adam_eps 1e-6 \
--G_ortho 0.0 \
--G_shared \
--G_init ortho --D_init ortho \
--hier --dim_z 120 --shared_dim 128 \
--G_eval_mode \
--G_ch 64 --D_ch 64 \
--ema --use_ema --ema_start 20000 \
--test_every 2000 --save_every 1000 --num_best_copies 5 --num_save_copies 2 --seed 0 \
--use_multiepoch_sampler \
diaodeyi commented 3 years ago

same question,why the step 1000 needs so big consume?