LMescheder / GAN_stability

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"
MIT License
918 stars 114 forks source link

Why multiply by 0.1 in the residual block? #11

Open zplizzi opened 5 years ago

zplizzi commented 5 years ago

In the paper and code (eg here), the output of the resnet blocks is multiplied by 0.1. I'm curious of the purpose of this. Does it have to do with the absence of batch-norm?

LMescheder commented 5 years ago

It just reduces the learning rate for those blocks by a factor of 10 (due to the adaptive optimizer RMSProp). We haven't played around with it too much and I think it might also work fine without the 0.1.

LuChengTHU commented 5 years ago

I removed the factor 0.1 and changed g_lr and d_lr from 1e-4 to 1e-5, but it cannot converge at all. I don't know the reason.

LMescheder commented 5 years ago

I removed the factor 0.1 and changed g_lr and d_lr from 1e-4 to 1e-5, but it cannot converge at all. I don't know the reason.

Thanks for reporting your experimental results. What architecture + dataset did you use? I quickly tried on celebA + LSUN churches at resolution 128^2 and there it appears to work fine without the 0.1 and a lr of 1e-5. One possible reason why it did not work for you could be that the 0.1 also changes the initialization, which can be quite important (for deep learning in general and our method in particular), as it only has local guarantees. What you can try is to add a

nn.init.zeros_(self.conv_1.weight)
nn.init.zeros_(self.conv_1.bias)

to the __init__ function of the ResNet blocks when removing the 0.1 and set both learning rates to 1e-5.

LuChengTHU commented 5 years ago

Thanks for your reply! I used celebA-HQ and the image size is 1024*1024. I just changed the lr in configs/celebA-HQ and removed the factor 0.1 in gan_training/models/resnet.py. I will try the initialization change. Thanks!

zplizzi commented 5 years ago

The 0.1 factor made more sense to me after reading the Fixup paper - it explains why standard initialization methods are poorly suited for ResNets and can cause immediate gradient explosion. The 0.1 factor is a rough approximation of the fix they suggest, which is down-scaling the initializations in the resnet blocks, and then potentially initializing the last conv layer of each block to 0 (as @LMescheder mentions above), along with a few other changes.