Closed xl-sr closed 3 years ago
Hi, thanks for reaching out! Yea the difference very likely depends on the number of training steps. We trained all models the same number of steps (exposing to the same amount of images) and reported results based on that. So the VQGAN model seems to need many more steps than the styleGAN family of models to converge.
Also, to provide more details about the comparison: note that all other models (including other baselines and the GANsformer) were implemented on top of the official styleGAN2 codebase to make sure they all have the same training scheme, optimization details fully equal, while for the VQGAN case because that's a more different than the unconditional GANs we used the original official implementation to train it.
Ah, I see.
Does the Gansformer reach lower FID values if you train it longer? Because it looks like it converged and the compute budget is just enough for convergence for a Gansformer but not for the other methods -- and at this point, you compare.
Since this issue was already raised elsewhere, why don't you report the values at the convergence of the other methods as well? Even if you do not train them yourself, the authors of the baselines provide the numbers or even pre-trained models on some of the datasets. Right now, the reported results are somewhat misleading, even if you explicitly state that the compute budget is restricted.
Hi, thank you for the question! Both the GANsformer and the other baseline models including vanila GAN, StyleGAN and SAGAN, and for CLEVR also kGAN, showed in fact a converging behavior in our reported experiments as you can see on the paper page 9 and is also attached here:
It is known both from our experience in training these models and from the reports in the StyleGAN1 and 2 papers that they keep very slowly improving for much after approaching their FID score convergence regime (e.g. possibly even for 10x more steps).
Indeed, the same is also true for the GANsformer model, and as we indicated on the readme, we keep training it at the moment where it shows better scores across multiple datasets, and we plan to release soon these new checkpoints after continuing training it further.
Since we train the models using the same training conditions, and for the same number of training steps, and since they reached convergence behavior as is determined by common convergence criteria of relative improvement per number of further training steps as is discussed here, and since we also explicitly indicated in the paper the number of training steps and experimental settings etc, we believe the comparison is fair and correct.
We also believe that learning and data efficiencies are important additional desirable properties of deep learning models and so hope that this comparison will encourage researchers not just to pursue the high-end limit of training lengths and model sizes that might be only attained by large corporates etc, but also the more reasonable, and potentially more important, ranges of compute time and resources.
Hope it helps!
Thank you for the insights :+1:
I definitely agree with your last point "this comparison will encourage researchers not just to pursue the high-end limit of training lengths and model sizes". A comparison with a fixed compute budget makes a lot of sense and will be helpful for other researchers.. and to keep deadlines ;). Maybe explicitly calling it "fixed compute benchmark" would help?
Not trying to review here, I was just confused by the numbers and I think reporting the official numbers in the compute limit, if available, would not lessen your results.
Btw, If you get around to doing it, I think many would appreciate an official PyTorch version
Hi, sure no worries at all I'm happy to discuss! will be happy to mention that.
I also totally agree that showing results in the limit will be great and helpful, the GPUs in my lab are busy crunching numbers to train the models further as we speak :)
I agree official pytorch implementation will be wonderful too. I'm working on a new project to NeurIPS and will be glad to do that right afterwards!
Awesome. Thanks for your answers and good luck on your NeurIPS project :)
Thanks a lot!
Thank you for open-sourcing your code :)
I was wondering about the generally very high FID values for the VQGAN. In the VQGAN paper, they report on, e.g., FFHQ 256x256 an FID of 11.4, whereas you report 63.1... Any idea why they are so different?
Thanks!