Closed thinkerthinker closed 5 years ago
Here is my training history on CelebA-HQ 256x256 (256_shortcut1_inject1_none_hq
):
At the first several hundred steps, d_loss was very small (negative) whilst g_loss started from a very large value (positive). Apparently, they converged at about 5k steps. I trained this model, the one you can download from the Google Drive, with the default setting in the readme.
CUDA_VISIBLE_DEVICES=0 python3 train.py --data CelebA-HQ --img_size 256 --shortcut_layers 1 --inject_layers 1 --experiment_name 256_shortcut1_inject1_none_hq --gpu
What is your training setting? How does your training history look like?
Thank you. I found that the loss divergence was caused by the use of multiple gpu. If I only use one gpu, the training process is correct. I think I might be having problems with Dataparallel。
Thank you, too. You reminded me the multi GPU problem which I also encountered recently.
Briefly speaking, the autograd to compute gradient penalty is accidently deleted when the function gradient_penalty()
is returned. This is a bug since PyTorch 1.0.0 released. See this issue.
You can try multi GPU training with the code I fixed just now.
Your answer has helped me a lot, thank you again for your reply.
You're welcome ^^
when i train attgan on celeba-hq, I found the loss is very strange, for example, 39%|▍| 684/1750 [26:59<39:02, 2.20s/it, d_loss=3.62e+10, epoch=0, g_loss=5.11e+5, iter=684] by visualizing the curves of loss, df_gp, df_loss, g_loss, gf_loss is very high, and they still increases. Are you re-adjusting the coefficients of each loss? Or what tricks did you use?