POSTECH-CVLab / PyTorch-StudioGAN

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.
https://github.com/MINGUKKANG
Other
3.41k stars 338 forks source link

ContraGAN no improvement at all #45

Closed curiousbyte19 closed 3 years ago

curiousbyte19 commented 3 years ago

I am not getting any improvement from contraGAN at all. For the main contribution, the FID improvement is not even 0.1. If i rerun FID score, I get worse score many times. Same for IS. other metrics not improving at all, as your table show. Including other method like diffaug is not fair as biggan can benefit from those method too.

fid score can change a lot more than 0.1, sometimes more than 1 point if you run it many times. how many times do you train the model or get the fid score?

for imagenet 128x128, ContraGAN score is even worse than bigGAN baseline. Since contragan is build on biggan, the result is no improvement or worse performance.

I want to use your work but it is hard to believe the score now, can you help explain? thanks

Saci46 commented 3 years ago

Same here

mingukkang commented 3 years ago

Hello,

Thank you so much for asking a really good question:)

I think that I should provide a detailed explanation about discrepancies between tables from our paper and Github repo.

I will answer the above questions as quickly as possible.

Thank you.

Best,

Minguk

mingukkang commented 3 years ago

Hi, curiousbyte19

Before starting, let me briefly explain a chronic problem when evaluating GAN's performance. Also, explain why numbers in our paper and StudioGAN are different from each other.

First of all, it is necessary to know that the way to evaluate the performance of GAN changes the results. As mentioned in Appendix B of the paper [C4], it is a known fact that Batch Normalization [C5] causes a performance inconsistency in the training and test phase. More specifically, in the training mode, it calculates batch statistics on the fly, and in the evaluation mode, it uses moving average statistics.

This causes inconsistency in the published papers. For instance, original Implementations of DCGAN [C6], SAGAN [C7] evaluate the performance in the training mode to improve the generation performance, although the models are in the test phase. Therefore, defining the rules of what phase will be used for the evaluation is important. Let me call it as evaluation protocol.

When we designed experiments for our ContraGAN paper, we used the same evaluation protocol used for DCGAN and SAGAN. It is to be consistent with the published works. Interestingly, BigGAN with this protocol method has shown large performance enhancement on CIFAR10 and Tiny ImageNet datasets. Please note that BigGAN with eval protocol gave worse FID values (I don’t remember the exact number, but the numbers are close to 12 ~ 13).

After the paper acceptance, we improved our implementation and we got some lessons from DiffAugGAN (Neurips 2020; concurrent with ContraGAN) [8] to configure BigGAN better. We update hyperparameters of BigGAN, advanced our implementation, and reported the scores on StudioGAN GitHub with various evaluation protocols for each dataset. These attempts are to provide a model with better evaluation results. This is why the BigGAN results in StudioGAN get better. We mentioned the evaluation protocol for each experiment on the Github page.

These are long backgrounds. Finally, here are my answers to your question:

1. I am not getting any improvement from contraGAN at all.

On CIFAR10, ContraGAN (FID is 8.065) shows better results than the original BigGAN [1] (reported FID is 14.73 in the BigGAN [1] paper). At the time of submission, we report the performance of our re-implementation of BigGAN (FID is 10.7)

After the release of StudioGAN, our BigGAN re-implementation (FID is 8.034) gets even better, but it is still comparable to ContraGAN (FID is 8.065). To our best knowledge, BigGAN implementation in StudioGAN shows the best performance among any BigGAN implementations with CIFAR10.

We also show that ContraGAN gives enhanced performances on the Tiny ImageNet dataset.

2. Including other method like DiffAug is not fair as biggan can benefit from those method too.

I am not sure if I understand the question correctly. We report the performance of BigGAN with DiffAug as well as ContraGAN with DiffAug. Since CR, ICR, DiffAug, and ADA are detachable regularization methods, we can apply the regularizations to all GAN frameworks. StudioGAN reports all results from GANs with these regularizations. Please refer to the Benchmark table in StudioGAN.

3. How many times do you train the model or get the fid score? I trained models three times to report the numbers in the paper and once for the StudioGAN.

I hope this answer helps. Feel free to comment if you have any questions.

References [C1] A. Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis. ICLR 2019. [C2] A, Siarohin et al., Whitening and Coloring Batch Transform for GANs, ICLR 2019. [C3] T. Miyato and M. Koyama., cGANs with Projection Discriminator, ICLR 2018. [C4] A. Brock et al., High-Performance Large-Scale Image Recognition Without Normalization, arXiv:2102.06171 2021. [C5] S. Loffe and C. Szegedy., Batch normalization: Accelerating deep network training by reducing internal covariate shift, PMLR, 2015. [C6] A. Radford et al., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 1511.06434, 2016. [C7] H. Zhang et al., Self-Attention Generative Adversarial Networks. ICML 2019. [C8] S. Zhao et al., Differentiable augmentation for data-efficient gan training, arXiv:2006.10738 2020.

mingukkang commented 3 years ago

Backup

issue

BigGAN consists of two networks - a generator with conditional batch normalization layers [C2] and a discriminator with a projection operation [C3] to inject label information. Although ContraGAN has the same generator architecture as BigGAN, there is no projection operation in the discriminator, but replaces it with conditional contrastive learning.

Baran-phys commented 3 years ago

What do you think is causing the difference in performance between your Studio-GAN's implementation of BigGAN and ajbrock's implementation? I mean specifically, did you add or subtract any layers/parts?

mingukkang commented 3 years ago

Hi, Hosein:)

After checking details for accurate reporting, I will add a discription on README.

Thank you.

curiousbyte19 commented 3 years ago

Thanks for answering.

I understand the background but that is not the problem. The main problem is either your "improvement" is so small compare to error range when evaluation or there is worse performance in the score compared to what you report in the paper, to know whether your work really improve anything. That is the main problem and this is something you did not answer.

The main problem can be summarized as this. If your main contribution is 0.1 fid difference and smaller than error range, OR did not even improve over bigGAN where your base model is, how can you say that ContraGAN "outperforms state-of-the-art models" according to your paper?

I address your points below.

Topic 1

No offense but your answer was not honest. When you say the BigGAN FID is 10.7 at the time of submitting, the score for ContraGAN is 10.6 in the same paper for the same dataset, same evaluation protocol. Why compare scores with different evaluation protocol since you already say the score will be not consistent?

Let me be clear on the score so that all can judge your work accurately. Your paper quote CIFAR10 score (TABLE 3)

The error range is higher than the improvement. How to know there is real improvement?

Now you say your studioGAN result is better. OK, but compare using the same method. Now we get the score

There is no improvement. I suggest you not to mix match the scores just to make the scores look better. This was not honest and wasting time. your new score also have no error range. when I run the fid, the result is worse than bigGAN score. I can run bigGAN model fid again and again and get better than 0.1 score than itself.

so let's be clear on CIFAR10. results from StudioGAN and your paper both show at most 0.1 improvement or and is highly not statistically significant to show state-of-the-art if we can just repeat run fid in the same model to get better score without using your main contribution.

OK now you say that tiny imagenet has improvement. But your TABLE 3 already show there was almost 4 point error range in BigGAN score

so we have the same problem again. how can you say your work improve anything when your improvement is many times smaller than the error range?

But even tiny imagenet is not important because you already have results for the imagenet 128 so we focus on this. Your reply also do not mention imagenet 128.

Big dataset is imagenet 128 and is the most important. your paper say 7.7% improvement (TABLE 3)

Is this only running 1 time? your studioGAN score say something very different and there is again no improvement. your README table show

This is worse score than bigGAN. this is comparing same batch size from your README table.

so does your paper wrongly say 7.7% improvement? Where is the GAN model with this improvement? Did you run everything same hyperparameter? This is one big claim from your paper but not reproduced.

also you say that you improve the hyperparameters in studioGAN to improve the implementation. But the scores is worse in your code compare to your paper for imagenet 128. Why?

Topic 2

What I mean is my original post is comparing your main contribution and not anything related to diffaug. This means no mix match and I compare

for cifar10 README table show

No improvement too. But this topic is not my main concern and I accept your explanation. Thank you.

Topic 3

If you train the model 3 times to get the numbers, why train 1 time for studio gan? why not give error range from the 3 times?

Your repo is suppose to reproduce your paper but now the results do not match your paper. the result is in fact worse than the paper reported.

Conclusion

I spend a lot of time trying to follow your work from the paper but when I came to studioGAN I cannot reproduce the scores from your paper. instead now I am convinced the work has no improvement.

My problem is I spend so much time thinking it is my problem because I run wrongly. But my conclusion is I am mislead by the results in your paper and there is no improvement in truth. Can you tell me why I am wrong?

If you know the model has no real improvement, why do you want to mix and match score to give your paper more credibility? This is not productive for anyone trying to research in this topic.

I hope you can answer my questions without trying to mix the scores again.


All the scores are from your own paper or your README table. https://arxiv.org/pdf/2006.12681.pdf

mingukkang commented 3 years ago

I read your comment several times and tried to understand what is your point.

I think the misunderstanding comes from two sentences of our paper: "For a fair comparison, we re-implement twelve state-of-the-art GANs using the PyTorch library. The software package is available at https://github.com/POSTECH-CVLab/PyTorch-StudioGAN " and "As shown in Table 2, our approach shows favorable performance in CIFAR10, but our approach exhibits larger variances". Most recent release of StudioGAN does not fully reflect the code state where we conduct all experiments for our paper.

StudioGAN had been released 9 months ago, and I have changed details to enhance baselines using lessons from papers and exhaustive trials. Also, we have said that ContraGAN shows "favorable" generation results and exhibits a large variance on CIFAR10 in our paper. ContraGAN shows a large enhancement compare to the original BigGAN.

From now, I try to answer the following question: "Why compare scores with different evaluation protocols since you already say the score will not be consistent?

I think it's the most confusing part. I tried to evaluate various GANs as consistently as possible to report reliable evaluation in our paper. However, in StudioGAN, we have reported each algorithm's best result by changing the evaluation mode and hyperparameters of training. We wrote it down clearly on README.

Then, I think you may have two questions. The first is why don't you conduct experiments three times for reporting tables in StudioGAN? Second, why are the performances of ImageNet256 worse than the numbers in our paper?

The answer to the first question is because it takes much time to conduct all experiments three times. I spent three months finishing all experiments in StudioGAN.

The answer to the second question is that the hyperparameter setups are different. We used TTUR + BigGAN+ 250 iter and TTUR + ContraGAN + 250 iter for our paper. However, in the StudioGAN, we used the original BigGAN paper's hyperparameter setup (only reduced batch size). This is because we want to show BigGAN and ContraGAN's results without TTUR training.

Lastly, I attach the pre-trained model of ContraGAN +TTUR + 250K on ImageNet and the evaluation log here (you directly evaluate this using StudioGAN).

checkpoint contra_biggan_imagenet128_hinge_no-train-2020_08_08_18_45_52.log

curiousbyte19 commented 3 years ago

Thanks for the explaining. if i now understand your reply, you say that the biggest difference come from different hyperparameter used in studiogan v your paper. So conclusion is

For the question on training three times, what i mean was why do you report one model score when you train three times for the paper? As your tiny imagenet results show, the FID change by 4 when running. is the other scores for imagenet 128 worse than 19.4? Not including the error range is strange because for TABLE 3 for other datasets you have also the error range.

because if studioGAN show without TTUR, the score is no improvement, is the original paper score no improvement too and good score in the paper come from only running fid one time?

Can you include the other 2 model log and checkpoint for contraGAN results you train for your paper? Thank you.

mingukkang commented 3 years ago

Hi, curiousbyte19

The ImageNet experiment was conducted once since it takes about 10 days to finish the training using TITAN-RTX 4 GPUs. This is written in Appendix B of our paper, so I don't have additional pre-trained ContraGANs on ImageNet. Also, this is why we used TTUR training. The TTUR training is known to stabilize training GANs. We applied this to evaluate GANs through a single trial.

Now, I might understand what are your concerns.

Training GANs is hard and unstable. Although we set the same hyperparameters as before, the dynamics between the generator and discriminator can become trivial and meaningless or can go to the other local equilibrium.

From the next work, I will try to conduct multiple experiments for more reliable evaluation.

And I will add implementation differences of CompareGAN (from Google), BigGAN_PyTorch (form Brock et al.), and StudioGAN.

It will take some time to finish.

Thank you so much.

Best,

Minguk

Saci46 commented 3 years ago

thank you for explaining this!

On Wed, Mar 10, 2021 at 12:20 AM MingukKANG notifications@github.com wrote:

Hi, curiousbyte19

The ImageNet experiment was conducted once since it takes about 10 days to finish the training using TITAN-RTX 4 GPUs.

This is written in Appendix B of our paper, so I don't have additional pre-trained ContraGANs on ImageNet.

Also, this is why we used TTUR training. The TTUR training is known to stabilize GANs training, thus we applied this for evaluating GANs through a single trial.

Now, I might understand what is your concerns.

Training GANs is hard and unstable. Although we set the same hyperparameters as before, the dynamics between the generation and discriminator can become trivial and meaningless.

From the next work, I will try to conduct multiple experiments for more reliable evaluation.

Thank you so much.

Best,

Minguk

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/POSTECH-CVLab/PyTorch-StudioGAN/issues/45#issuecomment-794646480, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATCVBAAXAUNQOLKRHFO36MLTC23MZANCNFSM4YR6J5KQ .